[英]Pandas: Drop and count consecutive duplicates with condition
当val等于1时,我想删除并计算列val中的重复项。
然后将start设置为第一行, end设置为连续重复的最后一行。
df = pd.DataFrame()
df['start'] = [1, 2, 3, 4, 5, 6, 18, 30, 31]
df['end'] = [2, 3, 4, 5, 6, 18, 30, 31, 32]
df['val'] = [1 , 1, 1, 1, 1, 12, 12, 1, 1]
df
start end val
0 1 2 1
1 2 3 1
2 3 4 1
3 4 5 1
4 5 6 1
5 6 18 12
6 18 30 12
7 30 31 1
8 31 32 1
预期结果
start end val
0 1 6 5
1 6 18 12
2 18 30 12
3 30 32 2
我试过了。 df[~((df.val==1) & (df.val == df.val.shift(1)) & (df.val == df.val.shift(-1)))]
start end val
0 1 2 1
4 5 6 1
5 6 18 12
6 18 30 12
7 30 31 1
8 31 32 1
但我不知道如何完成我的预期结果,有什么建议吗?
利用:
#mask by condition
m = df.val==1
#consecutive groups
g = m.ne(m.shift()).cumsum()
#filter by condition and aggregate per groups
df1 = df.groupby(g[m]).agg({'start':'first', 'end':'last', 'val':'sum'})
#concat together, for correct order create index by g
df = pd.concat([df1, df.set_index(g)[~m.values]]).sort_index().reset_index(drop=True)
print (df)
start end val
0 1 6 5
1 6 18 12
2 18 30 12
3 30 32 2
@jezrael 的解决方案是完美的,但这里的方法略有不同:
df['aux'] = (df['val'] != df['val'].shift()).cumsum()
df.loc[df['val'] == 1, 'end'] = df[df['val'] == 1].groupby('aux')['end'].transform('last')
df.loc[df['val'] == 1, 'val'] = df.groupby('aux')['val'].transform('sum')
df = df.drop_duplicates(subset=df.columns.difference(['start']), keep='first')
df = df.drop(columns=['aux'])
您还可以对 groupby 做一个带面具的两班:
m = (df.val.ne(1) | df.val.ne(df.val.shift())).cumsum()
df = df.groupby(m).agg({'start': 'first', 'end': 'last', 'val': 'last'})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.