[英]python pandas: applying different aggregate functions to different columns
I am trying to understand what the equivalent of this simple SQL statement would be: 我试图理解这个简单的SQL语句的等价物是什么:
select mykey, sum(Field1) as sum_of_field1, avg(Field1) as avg_field1, min(field2) as min_field2
from df
group by mykey
I understand I can passa a dictionary to the agg() function: 我明白我可以将字典传递给agg()函数:
f = {'Field1':'sum',
'Field2':['max','mean'],
'Field3':['min','mean','count'],
'Field4':'count'
}
grouped = df.groupby('mykey').agg(f)
However, the resulting column names seem to be chosen by pandas automatically: ('Field1','sum')
etc. 但是,结果列名似乎是由pandas自动选择的:(
('Field1','sum')
等。
Is there a way to pass strings for column names, so that the field is not ('Field1','sum')
but something I can choose, like sum_of_field1 ? 有没有办法为列名传递字符串,所以字段不是
('Field1','sum')
但我能选择的东西,如sum_of_field1?
Thanks. 谢谢。 I looked at the docs here: http://pandas.pydata.org/pandas-docs/stable/groupby.html but couldn't quite find an answer.
我查看了这里的文档: http : //pandas.pydata.org/pandas-docs/stable/groupby.html但是找不到答案。
As of pandas 0.25, this is possible with a "Named aggregation" . 从pandas 0.25开始,这可以通过“命名聚合”来实现 。
In [79]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
....: 'height': [9.1, 6.0, 9.5, 34.0],
....: 'weight': [7.9, 7.5, 9.9, 198.0]})
....:
In [80]: animals
Out[80]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
In [82]: animals.groupby("kind").agg(
....: min_height=('height', 'min'),
....: max_height=('height', 'max'),
....: average_weight=('weight', np.mean),
....: )
....:
Out[82]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
The previously deprecated version follows: 以前弃用的版本如下:
You can pass a dictionary of dictionaries to .agg
mapping {column: {name: aggfunc}}
, for example 例如,您可以将字典字典传递给
.agg
mapping {column: {name: aggfunc}}
In [46]: df.head()
Out[46]:
Year qtr realgdp realcons realinvs realgovt realdpi cpi_u M1 \
0 1950 1 1610.5 1058.9 198.1 361.0 1186.1 70.6 110.20
1 1950 2 1658.8 1075.9 220.4 366.4 1178.1 71.4 111.75
2 1950 3 1723.0 1131.0 239.7 359.6 1196.5 73.2 112.95
3 1950 4 1753.9 1097.6 271.8 382.5 1210.0 74.9 113.93
4 1951 1 1773.5 1122.8 242.9 421.9 1207.9 77.3 115.08
tbilrate unemp pop infl realint
0 1.12 6.4 149.461 0.0000 0.0000
1 1.17 5.6 150.260 4.5071 -3.3404
2 1.23 4.6 151.064 9.9590 -8.7290
3 1.35 4.2 151.871 9.1834 -7.8301
4 1.40 3.5 152.393 12.6160 -11.2160
In [47]: df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
"unemp": {"mean_unemp": "mean"}})
Out[47]:
realgdp unemp
mean_gdp std_gdp mean_unemp
qtr
1 4506.439216 2104.195963 5.694118
2 4546.043137 2121.824090 5.686275
3 4580.507843 2132.897955 5.662745
4 4617.592157 2158.132698 5.654902
The result has a MultiIndex in the columns. 结果在列中有一个MultiIndex。 If you don't want that outer level, you can use
.columns.droplevel(0)
. 如果您不想要该外层,可以使用
.columns.droplevel(0)
。
I agree this is a bit frustrating butI do find chaining with a rename
method served my purpose. 我同意这有点令人沮丧,但我发现用
rename
方法进行链接符合我的目的。 Also, when it gets really complex, I will just reset the column names. 此外,当它变得非常复杂,我只是将重置列名。 It is a MultiIndex so it is immutable, and you should feel comfortable dealing with levels.
它是一个MultiIndex,所以它是不可变的,你应该感觉很舒服处理关卡。
Based on the pandas documentation 基于pandas 文档
The resulting aggregations are named for the functions themselves.
生成的聚合以函数本身命名。 If you need to rename, then you can add in a chained operation for a Series like this
如果需要重命名,则可以为此系列添加链接操作
In [67]: (grouped['C'].agg([np.sum, np.mean, np.std])
....: .rename(columns={'sum': 'foo',
....: 'mean': 'bar',
....: 'std': 'baz'})
....: )
....:
Out[67]:
foo bar baz
A
bar 0.392940 0.130980 0.181231
foo -1.796421 -0.359284 0.912265
When there are multiples uses of one function and you want to name it differently, this question of dropping the level and joining the different levels by underscore will help. 当有一个功能的倍数用途和您希望以不同的名字,这个问题由下划线下降水平和加入不同程度的帮助。
If you do find the sql syntax cleaner, there is a library called pandasql that give you this flexibility. 如果你确实发现sql语法更清晰,那么有一个名为pandasql的库可以为你提供这种灵活性。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.