[英]Pythonic way to apply different aggregate functions to different columns of a pandas dataframe? And to name the columns efficiently?
In SQL it is very easy to apply different aggregate functions to different columns, eg :在 SQL 中,很容易将不同的聚合函数应用于不同的列,例如:
select item, sum(a) as [sum of a], avg(b) as [avg of b], min(c) as [min of c]
In pandas, not so much.在 pandas 中,没有那么多。 The solution provided here became deprecated:
此处提供的解决方案已被弃用:
df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
"unemp": {"mean_unemp": "mean"}})
The least worst solution I have managed to find, mostly based on other stack overflow questions I can no longer find, is something like the toy example at the bottom, where I:我设法找到的最差的解决方案,主要基于我再也找不到的其他堆栈溢出问题,类似于底部的玩具示例,其中我:
If you only have 2 or 3 columns to create, this solution is great.如果您只有 2 或 3 列要创建,则此解决方案非常棒。
However, if you have many columns to calculate, naming them becomes fiddly and very error-prone: I have to create a list with the column names, and pass that list as the index of the dataframe created by the function.但是,如果您有许多列要计算,那么命名它们会变得繁琐且非常容易出错:我必须创建一个包含列名的列表,并将该列表作为由 function 创建的 dataframe 的索引传递。
Now imagine I already have 12 columns and need to add 3 more;现在想象我已经有 12 列,需要再添加 3 列; there's a chance I may make some confusion and add the corresponding column names in the wrong order.
我可能会造成一些混淆并以错误的顺序添加相应的列名。
Compare this with SQL, where you assign the name right after defining the calculation - the difference is night and day.将此与 SQL 进行比较,您在定义计算后立即分配名称 - 区别在于白天和黑夜。
Is there a better way?有没有更好的办法? Eg a way to assign the name of the column at the same time I define the calculation ?
例如,在我定义计算的同时分配列名称的方法?
The focus of the question is specifically on how to name the columns so as to minimise the risk of errors and confusion.该问题的重点是如何命名列,以最大程度地减少错误和混淆的风险。 There are somewhat similar questions based on now deprecated functionalities of pandas, or with answers which provide an automatic naming of the columns but, to my knowledge, no question which focuses on this very point.
基于现已弃用的 pandas 功能或提供自动命名列的答案,有一些类似的问题,但据我所知,没有问题关注这一点。
import pandas as pd
import numpy as np
df = pd.DataFrame(columns =['a','b','c','d'], data = np.random.rand(300,4))
df['city'] = np.repeat(['London','New York','Buenos Aires'], 100)
def func(x, df):
# func() gets called within a lambda function; x is the row, df is the entire table
b1 = x['a'].sum()
b2 = x['a'].sum() / df['a'].sum() if df['a'].sum() !=0 else np.nan
b3 = x['b'].mean()
b4 = ( x['a'] * x['b']).sum() / x['b'].sum() if x['b'].sum() >0 else np.nan
b5 = x['c'].sum()
b6 = x['d'].sum()
cols = ['sum of a',
'% of a',
'avg of b',
'weighted avg of a, weighted by b',
'sum of c',
'sum of d']
return pd.Series( [b1, b2, b3, b4, b5, b6] , index = cols )
out = df.groupby('city').apply(lambda x: func(x,df))
I am not an expert but what I usually do is use a dictionary like this:我不是专家,但我通常会使用这样的字典:
import pandas as pd
import numpy as np
df = pd.DataFrame(columns =['a','b','c','d'], data = np.random.rand(300,4))
df['city'] = np.repeat(['London','New York','Buenos Aires'], 100)
def func(x, df):
# func() gets called within a lambda function; x is the row, df is the entire table
s_dict = {}
s_dict['sum of a'] = x['a'].sum()
s_dict['% of a'] = x['a'].sum() / df['a'].sum() if df['a'].sum() !=0 else np.nan
s_dict['avg of b'] = x['b'].mean()
s_dict['weighted avg of a, weighted by b'] = ( x['a'] * x['b']).sum() / x['b'].sum() if x['b'].sum() >0 else np.nan
s_dict['sum of c'] = x['c'].sum()
s_dict['sum of d'] = x['d'].sum()
return pd.Series( s_dict )
out = df.groupby('city').apply(lambda x: func(x,df))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.