10

Assume that we have the following pandas dataframe:

df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})

input:
 col1 col2  start
0  A>G  TCT   1000
1  C>T  ACA   2000
2  C>T  TCA   3000
3  G>T  TCA   4000
4  C>T  GCT   5000
5  A>G  ACT   6000
6  A>G  CTG  10000
7  A>G  ATG  20000
8  C>A  TCT  10000
9  C>T  ACA   2000
10 C>T  TCA   3000
11 C>T  TCA   4000

What I want to get is the number of consecutive values in col1 and length of these consecutive values and the difference between the last element's start and first element's start:

output:
 type length  diff
0  C>T  2   1000
1  A>G  3   14000
2  C>T  3   2000
burcak
  • 203
  • 1
  • 2
  • 4

1 Answers1

12

Break col1 into sub-groups of consecutive strings. Extract first and last entry per sub-group.

Something like this:

df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})

df['subgroup'] = (df['col1'] != df['col1'].shift(1)).cumsum()

col1 col2 start subgroup 0 A>G TCT 1000 1 1 C>T ACA 2000 2 2 C>T TCA 3000 2 3 G>T TCA 4000 3 4 C>T GCT 5000 4 5 A>G ACT 6000 5 6 A>G CTG 10000 5 7 A>G ATG 20000 5

df.groupby('subgroup',as_index=False).apply(lambda x: (x['col1'].head(1), x.shape[0], x['start'].iloc[-1] - x['start'].iloc[0]))

0 ([A>G], 1, 0) 1 ([C>T], 2, 1000) 2 ([G>T], 1, 0) 3 ([C>T], 1, 0) 4 ([A>G], 3, 14000)

Tweak as needed.

UPDATE: for pandas 1.1+ replace the last part with:

def func(x):
    result = {"type":x['col1'].head(1).values[0], "length": x.shape[0], "diff": x['start'].iloc[-1] - x['start'].iloc[0]}
    return pd.Series(result, name="index")
df.groupby('subgroup',as_index=False).apply(func)
oW_
  • 6,502
  • 4
  • 29
  • 47