In SQL it is very easy to apply different aggregate functions to different columns, eg :
select item, sum(a) as [sum of a], avg(b) as [avg of b], min(c) as [min of c]
In pandas, not so much. The solution provided here became deprecated:
df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
"unemp": {"mean_unemp": "mean"}})
The least worst solution I have managed to find, mostly based on other stack overflow questions I can no longer find, is something like the toy example at the bottom, where I:
If you only have 2 or 3 columns to create, this solution is great.
However, if you have many columns to calculate, naming them becomes fiddly and very error-prone: I have to create a list with the column names, and pass that list as the index of the dataframe created by the function.
Now imagine I already have 12 columns and need to add 3 more; there's a chance I may make some confusion and add the corresponding column names in the wrong order.
Compare this with SQL, where you assign the name right after defining the calculation - the difference is night and day.
Is there a better way? Eg a way to assign the name of the column at the same time I define the calculation ?
The focus of the question is specifically on how to name the columns so as to minimise the risk of errors and confusion. There are somewhat similar questions based on now deprecated functionalities of pandas, or with answers which provide an automatic naming of the columns but, to my knowledge, no question which focuses on this very point.
import pandas as pd
import numpy as np
df = pd.DataFrame(columns =['a','b','c','d'], data = np.random.rand(300,4))
df['city'] = np.repeat(['London','New York','Buenos Aires'], 100)
def func(x, df):
# func() gets called within a lambda function; x is the row, df is the entire table
b1 = x['a'].sum()
b2 = x['a'].sum() / df['a'].sum() if df['a'].sum() !=0 else np.nan
b3 = x['b'].mean()
b4 = ( x['a'] * x['b']).sum() / x['b'].sum() if x['b'].sum() >0 else np.nan
b5 = x['c'].sum()
b6 = x['d'].sum()
cols = ['sum of a',
'% of a',
'avg of b',
'weighted avg of a, weighted by b',
'sum of c',
'sum of d']
return pd.Series( [b1, b2, b3, b4, b5, b6] , index = cols )
out = df.groupby('city').apply(lambda x: func(x,df))
I am not an expert but what I usually do is use a dictionary like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(columns =['a','b','c','d'], data = np.random.rand(300,4))
df['city'] = np.repeat(['London','New York','Buenos Aires'], 100)
def func(x, df):
# func() gets called within a lambda function; x is the row, df is the entire table
s_dict = {}
s_dict['sum of a'] = x['a'].sum()
s_dict['% of a'] = x['a'].sum() / df['a'].sum() if df['a'].sum() !=0 else np.nan
s_dict['avg of b'] = x['b'].mean()
s_dict['weighted avg of a, weighted by b'] = ( x['a'] * x['b']).sum() / x['b'].sum() if x['b'].sum() >0 else np.nan
s_dict['sum of c'] = x['c'].sum()
s_dict['sum of d'] = x['d'].sum()
return pd.Series( s_dict )
out = df.groupby('city').apply(lambda x: func(x,df))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.