[英]pandas groupby aggregate customised function with multiple columns
I am trying to use a customised function with groupby
in pandas. 我正在尝试在pandas中使用
groupby
的自定义函数。 I find that using apply
allows me to do that in the following way: 我发现使用
apply
允许我以下列方式执行此操作:
(An example which calculates a new mean from two groups) (从两组计算新均值的示例)
import pandas as pd
def newAvg(x):
x['cm'] = x['count']*x['mean']
sCount = x['count'].sum()
sMean = x['cm'].sum()
return sMean/sCount
data = [['A', 4, 2.5], ['A', 3, 6], ['B', 4, 9.5], ['B', 3, 13]]
df = pd.DataFrame(data, columns=['pool', 'count', 'mean'])
df_gb = df.groupby(['pool']).apply(newAvg)
Is it possible to integrate this into an agg
function? 是否可以将其集成到
agg
函数中? Along these lines: 沿着这些方向:
df.groupby(['pool']).agg({'count': sum, ['count', 'mean']: apply(newAvg)})
Function agg
working with each column separately, so possible solution is create column cm
first with assign
and then aggregate sum
, last divide each columns: 函数
agg
处理每个列,因此可能的解决方案是先创建列cm
,然后使用assign
然后汇总sum
,最后划分每列:
df_gb = df.assign(cm=df['count']*df['mean']).groupby('pool')['cm','count'].sum()
print (df_gb)
cm count
pool
A 28.0 7
B 77.0 7
out = df_gb.pop('cm') / df_gb.pop('count')
print (out)
pool
A 4.0
B 11.0
dtype: float64
A dictionary with agg
is used to perform separate calculations for each series. 带有
agg
的字典用于为每个系列执行单独的计算。 For your problem, I suggest pd.concat
: 对于你的问题,我建议
pd.concat
:
g = df.groupby('pool')
res = pd.concat([g['count'].sum(), g.apply(newAvg).rename('newAvg')], axis=1)
print(res)
# count newAvg
# pool
# A 7 4.0
# B 7 11.0
This isn't the most efficient solution as your function newAvg
is performing calculations which can be performed on the entire dataframe initially, but it does support arbitrary pre-defined calculations. 这不是最有效的解决方案,因为您的函数
newAvg
正在执行可以在最初对整个数据帧执行的计算,但它确实支持任意预定义的计算。
IIUC IIUC
df.groupby(['pool']).apply(lambda x : pd.Series({'count':sum(x['count']),'newavg':newAvg(x)}))
Out[58]:
count newavg
pool
A 7.0 4.0
B 7.0 11.0
Use assign
with eval
: 使用带有
eval
assign
:
df.assign(cm=df['count']*df['mean'])\
.groupby('pool', as_index=False)['cm','count'].sum()\
.eval('AggCol = cm / count')
Output: 输出:
pool cm count AggCol
0 A 28.0 7 4.0
1 B 77.0 7 11.0
If you are calculating a weighted average, you can do it easily using agg
and NumPy np.average
function. 如果您正在计算加权平均值,则可以使用
agg
和NumPy np.average
函数轻松np.average
。 Just read the Series for the 'mean' column: 只需阅读系列中的“均值”列:
df_gb = df.groupby(['pool']).agg(lambda x: np.average(x['mean'], weights=x['count']))['mean']
You could also do it using your newAvg
function, although this will produce warnings: 您也可以使用
newAvg
函数执行此操作,但这会产生警告:
df_gb2 = df.groupby(['pool']).agg(newAvg)['mean']
If you are willing to use newAvg
function, you can redefine it to avoid working on copies: 如果您愿意使用
newAvg
函数,可以重新定义它以避免处理副本:
def newAvg(x):
cm = x['count']*x['mean']
sCount = x['count'].sum()
sMean = cm.sum()
return sMean/sCount
With this modification, you get your expected output: 通过此修改,您可以获得预期的输出:
df_gb2 = df.groupby(['pool']).agg(newAvg)['mean']
print(df_gb2)
# pool
# A 4.0
# B 11.0
# Name: mean, dtype: float64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.