简体   繁体   中英

Pythonic way to apply different aggregate functions to different columns of a pandas dataframe? And to name the columns efficiently?

My issue

In SQL it is very easy to apply different aggregate functions to different columns, eg :

select item, sum(a) as [sum of a], avg(b) as [avg of b], min(c) as [min of c]

In pandas, not so much. The solution provided here became deprecated:

df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
                                "unemp": {"mean_unemp": "mean"}})

My solution

The least worst solution I have managed to find, mostly based on other stack overflow questions I can no longer find, is something like the toy example at the bottom, where I:

  • define a function with all the calculations I need
  • calculate each column separately, then put them together in a dataframe
  • apply the function as a lambda function:

What I would like to improve: naming columns

If you only have 2 or 3 columns to create, this solution is great.

However, if you have many columns to calculate, naming them becomes fiddly and very error-prone: I have to create a list with the column names, and pass that list as the index of the dataframe created by the function.

Now imagine I already have 12 columns and need to add 3 more; there's a chance I may make some confusion and add the corresponding column names in the wrong order.

Compare this with SQL, where you assign the name right after defining the calculation - the difference is night and day.

Is there a better way? Eg a way to assign the name of the column at the same time I define the calculation ?

Why this is not a duplicate question

The focus of the question is specifically on how to name the columns so as to minimise the risk of errors and confusion. There are somewhat similar questions based on now deprecated functionalities of pandas, or with answers which provide an automatic naming of the columns but, to my knowledge, no question which focuses on this very point.

Toy example

import pandas as pd
import numpy as np

df = pd.DataFrame(columns =['a','b','c','d'], data = np.random.rand(300,4))
df['city'] = np.repeat(['London','New York','Buenos Aires'], 100)

def func(x, df):
    # func() gets called within a lambda function; x is the row, df is the entire table    
    b1 = x['a'].sum()
    b2 = x['a'].sum() / df['a'].sum() if df['a'].sum() !=0 else np.nan
    
    b3 = x['b'].mean()
    
    b4 = ( x['a'] * x['b']).sum() / x['b'].sum() if x['b'].sum() >0 else np.nan
    
    b5 = x['c'].sum()
    b6 = x['d'].sum()
    
    
    cols = ['sum of a',
            '% of a',
            'avg of b',
            'weighted avg of a, weighted by b', 
            'sum of c',
            'sum of d']
    

    return pd.Series( [b1, b2, b3, b4, b5, b6] , index = cols ) 

out = df.groupby('city').apply(lambda x: func(x,df))

I am not an expert but what I usually do is use a dictionary like this:

import pandas as pd
import numpy as np

df = pd.DataFrame(columns =['a','b','c','d'], data = np.random.rand(300,4))
df['city'] = np.repeat(['London','New York','Buenos Aires'], 100)

def func(x, df):
    # func() gets called within a lambda function; x is the row, df is the entire table   
    s_dict = {}

    s_dict['sum of a'] = x['a'].sum()
    s_dict['% of a'] = x['a'].sum() / df['a'].sum() if df['a'].sum() !=0 else np.nan
    
    s_dict['avg of b'] = x['b'].mean()
    
    s_dict['weighted avg of a, weighted by b'] = ( x['a'] * x['b']).sum() / x['b'].sum() if x['b'].sum() >0 else np.nan
    
    s_dict['sum of c'] = x['c'].sum()
    s_dict['sum of d'] = x['d'].sum()
    
    return pd.Series( s_dict  ) 

out = df.groupby('city').apply(lambda x: func(x,df))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM