0

I am using pandas and am dealing with time series of sales.

What I would like to do is to remove the columns where a certain number of consecutive zeros appear, since forecasting for sparse series or repeated zero values tend to be unreliable. My dataframe has a bit more than 4,400 time series for a time span of 5 years.

I have tried applying this strategy from a previous SO question, but I cannot drop the columns fulfilling the condition.

Below is an example where the threshold would be 12. Any column having 12 or more consecutive zeros would be discarded.

Input:

      A   B   C

0 1 1 8 1 2 0 14 2 3 0 20 3 0 0 15 4 0 0 23 5 0 0 25 6 0 0 22 7 0 0 18 8 0 0 16 9 0 0 14 10 0 0 8 11 1 0 10 12 18 0 7 13 34 0 4 14 110 1 14 15 11 30 2 16 0 24 8 17 0 11 7 18 1 22 11 19 0 33 3 20 0 90 12 21 12 32 19 22 11 90 17 23 77 13 2

Desired output:

      A   C

0 1 8 1 2 14 2 3 20 3 0 15 4 0 23 5 0 25 6 0 22 7 0 18 8 0 16 9 0 14 10 0 8 11 1 10 12 18 7 13 34 4 14 110 14 15 11 2 16 0 8 17 0 7 18 1 11 19 0 3 20 0 12 21 12 19 22 11 17 23 77 2

Thanks.

Ethan
  • 1,657
  • 9
  • 25
  • 39
Nathan Furnal
  • 275
  • 1
  • 3
  • 10

4 Answers4

0
def max0(sr):
    return (sr != 0).cumsum().value_counts().max() - (0 if (sr != 0).cumsum().value_counts().idxmax()==0 else 1)
max0(pd.Series([0,0,0,0]))

Output:
> 4
max0(pd.Series([1,0,0,0,0,2,3]))
Output:
> 4
max0(pd.Series([0]))
Output:

> 1
max0(pd.Series([1]))
Output:

> 0

fuwiak
  • 1,373
  • 8
  • 14
  • 26
RELW
  • 1
0

You can try rolling().sum:

thresh = 12

to_drop = df.eq(0).rolling(thresh).sum().eq(thresh).any() df.loc[:, ~to_drop]

Output:

      A   C
0     1   8
1     2  14
2     3  20
3     0  15
4     0  23
5     0  25
6     0  22
7     0  18
8     0  16
9     0  14
10    0   8
11    1  10
12   18   7
13   34   4
14  110  14
15   11   2
16    0   8
17    0   7
18    1  11
19    0   3
20    0  12
21   12  19
22   11  17
23   77   2
Quang Hoang
  • 131
  • 2
0

An alternative to archived what you are looking for is:

threshold = 12

drops = [l[0] for l in list(filter(lambda x: x[1] > threshold,[(col, (df.groupby((df[col] != 0).cumsum()).cumcount()).max()) for col in df.columns]))]

df.drop(drops, axis = 1)

Outputs:

    A   C
0   1   8
1   2   14
2   3   20
3   0   15
4   0   23
5   0   25
6   0   22
7   0   18
8   0   16
9   0   14
10  0   8
11  1   10
12  18  7
13  34  4
14  110 14
15  11  2
16  0   8
17  0   7
18  1   11
19  0   3
20  0   12
21  12  19
22  11  17
23  77  2
Multivac
  • 3,199
  • 2
  • 10
  • 26
-1

Short answer:

# Max number of zeros in a row
threshold = 12

# 1. transform the column to boolean is_zero
# 2. calculate the cumulative sum to get the number of cumulative 0
# 3. Check how much of each count you get and remove 0 counts
# 4. Get the maximum number of cumulative zeros
# 6. Drop the columns with more than your threshold

to_remove = []
for col in df.columns:
    count_zeros = (df[col] == 0).cumsum().value_counts().drop(0)
    if len(count_zeros) and count_zeros.idxmax() > threshold:
        to_remove.append(col)

df = df.drop(to_remove, axis=1)

You're welcome

qmeeus
  • 1,389
  • 1
  • 11
  • 13