[英]Pandas dataframe, how can I group by multiple columns and apply sum for specific column and add new count column?
Given a dataframe df1 as follows:给定一个 dataframe df1 如下:
Col1 Col2 Col3 Col4 Col5
-------------------------------------
A 1 AA 10 Test1
A 1 AA 5 Test2
A 2 AB 30 Test3
B 4 FF 10 Test4
C 1 HH 4 Test7
C 3 GG 6 Test8
C 3 GG 7 Test9
D 1 AA 4 Test5
D 3 FF 6 Test6
I want to group by Col1, Col2 and Col3 and我想按 Col1、Col2 和 Col3 分组,
Add new column Count: size of each group添加新列计数:每组的大小
Add new column Col4_sum: sum of each Col4 in each group添加新列 Col4_sum:每组中每个 Col4 的总和
Output need Output 需要
Col1 Col2 Col3 Count Col4_sum
----------------------------------------
A 1 AA 2 15
A 2 AB 1 30
B 4 FF 1 10
C 1 HH 1 4
C 3 GG 2 13
D 1 AA 1 4
D 3 FF 1 6
I try to use我尝试使用
df1.groupby(['Col1','Col2','Col3']).size
but get only Count column.但只得到 Count 列。
Use GroupBy.agg
with tuples for specify aggregate function with new columns names:将
GroupBy.agg
与元组一起使用以指定具有新列名称的聚合 function:
df = (df1.groupby(['Col1','Col2','Col3'])['Col4']
.agg([('Count','size'), ('Col4_sum','sum')])
.reset_index())
print (df)
Col1 Col2 Col3 Count Col4_sum
0 A 1 AA 2 15
1 A 2 AB 1 30
2 B 4 FF 1 10
3 C 1 HH 1 4
4 C 3 GG 2 13
5 D 1 AA 1 4
6 D 3 FF 1 6
In pandas 0.25+ is possible use named aggregation
:在 pandas 0.25+ 中可以使用
named aggregation
:
df = (df1.groupby(['Col1','Col2','Col3'])
.agg(Count=('Col5', 'size'), Col4_sum=('Col4', 'sum'))
.reset_index())
print (df)
Col1 Col2 Col3 Count Col4_sum
0 A 1 AA 2 15
1 A 2 AB 1 30
2 B 4 FF 1 10
3 C 1 HH 1 4
4 C 3 GG 2 13
5 D 1 AA 1 4
6 D 3 FF 1 6
You can use a dict of column names and aggregation functions.您可以使用列名和聚合函数的字典。 See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.aggregate.html
见https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.aggregate.html
>>> df = pd.DataFrame([[1, 2, 3],
... [4, 5, 6],
... [7, 8, 9],
... [np.nan, np.nan, np.nan]],
... columns=['A', 'B', 'C'])
>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
# A B
# max NaN 8.0
# min 1.0 2.0
# sum 12.0 NaN
Another solution that's a bit more verbose and hasn't been mentioned is to use the assign function as follows:另一个更冗长且未提及的解决方案是使用分配 function ,如下所示:
df = df1.assign(Count=df1.groupby(['Col1','Col2','Col3']).Col4.transform('size'))
.assign(Col4_sum=df1.groupby(['Col1','Col2','Col3']).Col4.transform('sum'))
.reset_index()
This should solve your problem.这应该可以解决您的问题。
df2 = df.groupby(['Col1','Col2','Col3'])['Col4'].agg('sum')
With the agg function and a dictionary, you can customise your output like so使用 agg function 和字典,您可以像这样自定义 output
df.groupby(['Col1','Col2','Col3']).agg({'Col3': ['count'], 'Col4': ['count','sum']})
This should return a group for Col1, Col2, and Col3, while aggregating the count for Col3, and then the count and sum for Col4这应该为 Col1、Col2 和 Col3 返回一个组,同时聚合 Col3 的计数,然后是 Col4 的计数和总和
You can use the function pivot_table
:您可以使用 function
pivot_table
:
df = pd.pivot_table(df, index=['Col1', 'Col2', 'Col3'], values='Col4', aggfunc=['count', 'sum']).reset_index()
df.columns = ['Col1', 'Col2', 'Col3', 'Count', 'Col4_sum']
Output: Output:
Col1 Col2 Col3 Count Col4_sum
0 A 1 AA 2 15
1 A 2 AB 1 30
2 B 4 FF 1 10
3 C 1 HH 1 4
4 C 3 GG 2 13
5 D 1 AA 1 4
6 D 3 FF 1 6
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.