13

I have a dataset like this:

Sample Dataframe

import pandas as pd

df = pd.DataFrame({
    'names': ['A','B','C','D','E','F','G','H','I','J','K','L'],
    'col1': [0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0],
    'col2': [0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0]})

I'd like to replace some of the 0's in col1 and col2 with 1's, but not replace the 0's if three or more 0's are consecutive in the same column. How can this be done with pandas?

Original Dataset:

names   col1    col2
A   0   0
B   1   0
C   0   0
D   1   0
E   1   1
F   1   0
G   0   1
H   0   0
I   0   1
J   1   0
K   0   0
L   0   0

Desired Dataset:

names   col1    col2
A   1   0
B   1   0
C   1   0
D   1   0
E   1   1
F   1   1
G   0   1
H   0   1
I   0   1
J   1   0
K   1   0
L   1   0
Kevin
  • 543
  • 2
  • 5
  • 12

3 Answers3

12

Consider the following approach:

def f(col, threshold=3):
    mask = col.groupby((col != col.shift()).cumsum()).transform('count').lt(threshold)
    mask &= col.eq(0)
    col.update(col.loc[mask].replace(0,1))
    return col

In [79]: df.apply(f, threshold=3)
Out[79]:
       col1  col2
names
A         1     0
B         1     0
C         1     0
D         1     0
E         1     1
F         1     1
G         0     1
H         0     1
I         0     1
J         1     0
K         1     0
L         1     0

Step by step:

In [84]: col = df['col2']

In [85]: col
Out[85]:
names
A    0
B    0
C    0
D    0
E    1
F    0
G    1
H    0
I    1
J    0
K    0
L    0
Name: col2, dtype: int64

In [86]: (col != col.shift()).cumsum()
Out[86]:
names
A    1
B    1
C    1
D    1
E    2
F    3
G    4
H    5
I    6
J    7
K    7
L    7
Name: col2, dtype: int32

In [87]: col.groupby((col != col.shift()).cumsum()).transform('count')
Out[87]:
names
A    4
B    4
C    4
D    4
E    1
F    1
G    1
H    1
I    1
J    3
K    3
L    3
Name: col2, dtype: int64

In [88]: col.groupby((col != col.shift()).cumsum()).transform('count').lt(3)
Out[88]:
names
A    False
B    False
C    False
D    False
E     True
F     True
G     True
H     True
I     True
J    False
K    False
L    False
Name: col2, dtype: bool

In [89]: col.groupby((col != col.shift()).cumsum()).transform('count').lt(3) & col.eq(0)
Out[89]:
names
A    False
B    False
C    False
D    False
E    False
F     True
G    False
H     True
I    False
J    False
K    False
L    False
Name: col2, dtype: bool
6

You should use pandas.DataFrame.shift() to find the pattern you need.

Code:

def fill_zero_not_3(series):
    zeros = (True, True, True)
    runs = [tuple(x == 0 for x in r)
            for r in zip(*(series.shift(i)
                           for i in (-2, -1, 0, 1, 2)))]
    need_fill = [(r[0:3] != zeros and r[1:4] != zeros and r[2:5] != zeros)
                 for r in runs]
    retval = series.copy()
    retval[need_fill] = 1
    return retval

Test Code:

import pandas as pd

df = pd.DataFrame({
    'names': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L'],
    'col1': [0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0],
    'col2': [0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0]}).set_index('names')

df['col1'] = fill_zero_not_3(df['col1'])
df['col2'] = fill_zero_not_3(df['col2'])
print(df)

Results:

       col1  col2
names            
A         1     0
B         1     0
C         1     0
D         1     0
E         1     1
F         1     1
G         0     1
H         0     1
I         0     1
J         1     0
K         1     0
L         1     0
Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
3

@Stephen Rauch 's answer is very smart, but it's slow when I applied it to a large dataset. Inspired by this post, I think I got a more efficient way to achieve the same goal.

The code:

import pandas as pd

df = pd.DataFrame({
    'names': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L'],
    'col1': [0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0],
    'col2': [0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0]}).set_index('names')

for i in range(df.shape[1]):
    iszero = np.concatenate(([0], np.equal(df.iloc[:, i].values, 0).view(np.int8), [0]))
    absdiff = np.abs(np.diff(iszero))
    zerorange = np.where(absdiff == 1)[0].reshape(-1, 2)
    for j in range(len(zerorange)):
        if zerorange[j][1] - zerorange[j][0] < 3:
            df.iloc[zerorange[j][0]:zerorange[j][1], i] = 1
print(df)

Results:

        col1  col2
names            
A         1     0
B         1     0
C         1     0
D         1     0
E         1     1
F         1     1
G         0     1
H         0     1
I         0     1
J         1     0
K         1     0
L         1     0
Kevin
  • 543
  • 2
  • 5
  • 12