![](/img/trans.png)
[英]How to find the longest consecutive string of values in pandas dataframe
[英]How to find the count of consecutive same string values in a pandas dataframe?
假設我們有以下熊貓數據框:
df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})
input:
col1 col2 start
0 A>G TCT 1000
1 C>T ACA 2000
2 C>T TCA 3000
3 G>T TCA 4000
4 C>T GCT 5000
5 A>G ACT 6000
6 A>G CTG 10000
7 A>G ATG 20000
8 C>A TCT 10000
9 C>T ACA 2000
10 C>T TCA 3000
11 C>T TCA 4000
我想要得到的是col1中連續值的數量,這些連續值的長度以及最后一個元素的開始與第一個元素的開始之間的差:
output:
type length diff
0 C>T 2 1000
1 A>G 3 14000
2 C>T 3 2000
稍作設置,您就可以使用GroupBy.agg
將其100%向量化:
aggfunc = {
'col1': [('type', 'first'), ('length', 'count')],
'start': [('diff', lambda x: abs(x.iat[-1] - x.iat[0]))]
}
grouper = df.col1.ne(df.col1.shift()).cumsum()
v = df.assign(key=grouper).groupby('key').agg(aggfunc)
v.columns = v.columns.droplevel(0)
v[v['diff'].ne(0)].reset_index(drop=True)
type length diff
0 C>T 2 1000
1 A>G 3 14000
2 C>T 3 2000
可能類似於以下內容:
import pandas as pd
from itertools import groupby
df = pd.DataFrame({
'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G','C>T','C>T','C>T'],
'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG','ACA','TCA','TCA'],
'start':[1000,2000,3000,4000,5000,6000,10000,20000,2000,3000,4000]})
final = []
pos = 0
for k,g in groupby([row.col1 for n,row in df.iterrows()]):
glist = [x for x in g]
first_pos = pos
last_pos = pos+len(glist)-1
if len(glist)>1:
print(glist)
val = df.iloc[first_pos].col1
first = df.iloc[first_pos].start
last = df.iloc[last_pos].start
final.append({'type':val,'length':len(glist),'diff':last-first})
pos = last_pos +1
final = pd.DataFrame(final)
print(final)
輸出:
diff length type
0 1000 2 C>T
1 14000 3 A>G
2 2000 3 C>T
您可以使用pandas groupby
和more_itertools
:
import more_itertools as mit
def f(g):
result = pd.DataFrame([], columns={'type', 'length', 'diff'})
tp = g['col1'].iloc[0]
for group in mit.consecutive_groups(g.index):
group = list(group)
if len(group) == 1:
continue
cur_df = pd.DataFrame({'type': [tp], 'length': [len(group)], 'diff': g.loc[group[-1]]['start'] - g.loc[group[0]]['start']})
result = pd.concat([result, cur_df], ignore_index=True)
return result
df.groupby('col1').apply(f).reset_index(drop=True)
這是一個分為兩個步驟的解決方案,首先創建一個輔助列來標記連續出現的同一字符串,然后使用標准pandas groupby:
# add a group variable
values = df['col1'].values
# get locations where value changes
change = np.zeros(values.size, dtype=bool)
change[1:] = values[:-1] != values[1:]
df['group'] = change.cumsum() # summing change points yields the label
# do the aggregation
res = (df
.groupby('group')
.agg({'start': lambda x: x.max() - x.min(), 'col1': 'first', 'col2': 'size'})
.rename(columns={'col1': 'type', 'col2': 'length', 'start': 'diff'})
)
# filter on more than one consecutive value
res = res[res['length'] > 1]
print(res)
diff type length
group
1 1000 C>T 2
4 14000 A>G 3
5 2000 C>T 3
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.