2

I have a df with many columns that represent the market cap of companies that compose an index. The index of the dataframe is dates. Before the company enters the index or after it leaves it, the market cap of the company in the df is zero.

I want to know the mean number of days in which the index doesn't change.

for example:

[in]: 
df = pd.DataFrame(np.array([[1, 1,np.nan], [np.nan,2, 10], [1,3, 100],[4,np.nan, 100]]), 
                           columns=['a', 'b','c'])

df
[out]:  
        a   b   c    
    0   1.0 1.0 NaN    
    1   NaN 2.0 10.0    
    2   1.0 3.0 100.0    
    3   4.0 NaN 100.0

what I want to know is how many rows have entries from the same column.

For example, row 0 has entries from columns a and b. row 1 from columns b and c. row 2 from columns a,b and c. an row 3 from columns a and c. therefore there are 4 rows with unique column combinations and 4 changes. The mean is then 1.

bbarroso
  • 21
  • 3

3 Answers3

2

By building a tuple of whether a value is nan or not, you can then construct each unique like:

Code:

def get_average_run_length(a_df):
    unique_run_lengths = set(
        tuple(not np.isnan(x) for x in row[1]) for row in df.iterrows())
    return len(a_df) / len(unique_run_lengths)

Test Code:

df = pd.DataFrame(
    [[1, 1, np.nan], [np.nan, 2, 10], [1, 3, 100], [4, np.nan, 100]],
    columns=list('abc'))

print(df)
print(get_average_run_length(df))

# double the last row
df = pd.concat([df, df.tail(1)])
print(df)
print(get_average_run_length(df))

Results:

     a    b      c
0  1.0  1.0    NaN
1  NaN  2.0   10.0
2  1.0  3.0  100.0
3  4.0  NaN  100.0

1.0

     a    b      c
0  1.0  1.0    NaN
1  NaN  2.0   10.0
2  1.0  3.0  100.0
3  4.0  NaN  100.0
3  4.0  NaN  100.0

1.25
Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
1

I am not entirely sure if I understand your question correctly. What result do you want for the following dataframe: 2/5 because there are 2 different NaN-combinations in 5 rows, or 4/5 because there are 4 changes in the index composition in 5 rows?

    a   b
0   1   NaN
1   NaN 2
2   3   NaN
3   NaN 4
4   5   NaN

Stephen Rauch's answer covers the first scenario. This one should cover the second:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.array([[1, 2, np.nan],
                            [3, 4, np.nan],
                            [6, np.nan, 7],
                            [8, np.nan, np.nan]]))

df_na = np.array(df.isna()[:-1])
df_na_shifted = np.array(df.isna()[1:])

number_of_good_cols = (df_na == df_na_shifted).sum(axis=1)

number_of_good_rows = sum(number_of_good_cols == df.shape[1])
print('The index changed on %d days.' % (df.shape[0] - 1 - number_of_good_rows))
Elias Schoof
  • 1,646
  • 11
  • 25
0

Since the core of the algorithm seems to be the calculation of the number of unique combinations of columns that dont evaluate to nan in a specific row, id like to suggest a more one linerish solution:

unique_col_num =  df.notna().astype(int).astype(str).sum(axis=1).unique().shape[0]
average_n_same_col = df.shape[0]/unique_col_num

Now, number of rows sharing the same column selection, on average is "averaga_n_same_col".

petr
  • 176
  • 4