[英]Python - Group-by multiple columns with .mean() and .agg()
我想分组三列,然后找到所有行的第四个数字列的平均值,这些行在前三列中重复。 我可以通过以下功能实现此目的:
df2 = df.groupby(['col1', 'col2', 'col3'], as_index=False)['col4'].mean()
问题是我还想要第五列,它将聚合由groupby函数分组的所有行,我不知道如何在上一个函数之上做。 例如:
df
index col1 col2 col3 col4 col5
0 Week_1 James John 1 when and why?
1 Week_1 James John 3 How?
2 Week_2 James John 2 Do you know when?
3 Week_2 Mark Jim 3 What time?
4 Week_2 Andrew Simon 1 How far is it?
5 Week_2 Andrew Simon 2 Are you going?
CURRENT(with above function):
index col1 col2 col3 col4
0 Week_1 James John 2
1 Week_2 James John 2
2 Week_2 Mark Jim 3
3 Week_2 Andrew Simon 1.5
DESIRED:
index col1 col2 col3 col4 col5
0 Week_1 James John 2 when and why?, How?
2 Week_2 James John 2 Do you know when?
3 Week_2 Mark Jim 3 What time?
4 Week_2 Andrew Simon 1.5 How far is it?, Are you going?
我在这里和这里试过,但是我正在使用的.mean()函数使这个过程变得复杂。 任何帮助,将不胜感激。 (如果可能的话,我想在聚合时指定一个自定义分隔符来分隔col5的字符串)。
您可以为每个列聚合函数定义:
df2=df.groupby(['col1','col2','col3'], as_index=False).agg({'col4':'mean', 'col5':','.join})
print (df2)
col1 col2 col3 col4 col5
0 Week_1 James John 2.0 when and why?,How?
1 Week_2 Andrew Simon 1.5 How far is it?,Are you going?
2 Week_2 James John 2.0 Do you know when?
3 Week_2 Mark Jim 3.0 What time?
一般解决方案是按mean
聚合的数字列和其他通过join
:
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else ', '.join(x)
df2 = df.groupby(['col1', 'col2', 'col3'], as_index=False).agg(f)
print (df2)
col1 col2 col3 col4 col5
0 Week_1 James John 2.0 when and why?, How?
1 Week_2 Andrew Simon 1.5 How far is it?, Are you going?
2 Week_2 James John 2.0 Do you know when?
3 Week_2 Mark Jim 3.0 What time?
df = pd.DataFrame({
'col1':['a','a','b','b'],
'col2':[1,2,1,1],
'col3':['str1','str2','str3','str4']
})
result = df.groupby(['col1','col2'])['col3'].apply(lambda x:','.join(list(x)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.