[英]How to groupby multiple columns and aggregate data in pandas
我有一個 pandas dataframe 有幾列( words, start time, stop time, speaker
):
word
列中的所有值,而data
列中的值不改變或meta_data
列中的值不改變。start
和最后一個單詞的stop
值。我目前有:
word start stop data meta_data
0 but 2.72 2.85 2 9
1 that's 2.85 3.09 2 9
2 alright 3.09 3.47 2 1
3 we'll 8.43 8.69 1 4
4 have 8.69 8.97 1 4
5 to 8.97 9.07 1 4
6 okay 9.19 10.01 2 2
7 sure 10.02 11.01 2 1
8 what? 11.02 12.00 1 4
但是,我想把它變成:
word start start data meta_data
0 but that's 2.72 3.09 2 9
1 alright 3.09 3.47 2 1
2 we'll have to 8.43 9.07 1 4
3 okay 9.19 10.01 2 2
4 sure 10.02 11.01 2 1
5 what? 11.02 12.00 1 4
這需要創建一個幫助鍵,然后我們使用shift
+ cumsum
創建 groupkey
df['Key']=df[['data','meta_data']].apply(tuple,1)
d={'word':' '.join,'start':'min','stop':'max','data':'first','meta_data':'first'}
df.groupby(df.Key.ne(df.Key.shift()).cumsum()).agg(d).reset_index(drop=True)
Out[171]:
word start stop data meta_data
0 but that's 2.72 3.09 2 9
1 alright 3.09 3.47 2 1
2 we'll have to 8.43 9.07 1 4
3 okay 9.19 10.01 2 2
4 sure 10.02 11.01 2 1
5 what? 11.02 12.00 1 4
在這里做一些數學+ GroupBy.agg
s=df['data']+df['meta_data']
groups=s.ne(s.shift()).cumsum()
new_df=( df.groupby(groups)
.agg({'word':' '.join,'start':'min',
'stop':'max','data':'first',
'meta_data':'first'}) )
print(new_df)
word start stop data meta_data
1 but that's 2.72 3.09 2 9
2 alright 3.09 3.47 2 1
3 we'll have to 8.43 9.07 1 4
4 okay 9.19 10.01 2 2
5 sure 10.02 11.01 2 1
6 what? 11.02 12.00 1 4
如果您認為總和可以對應於兩個不同且連續的組,您可以使用更復雜的 function 和小數
p=(df['data']+0.1723).pow(df['meta_data']+2.017)
groups=p.ne(p.shift()).cumsum()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.