简体   繁体   中英

pandas aggregate function with multiple output columns

I am trying to define an aggregation function with more than one OUTPUT columns, which i would like to use as follows

df.groupby(by=...).agg(my_aggregation_function_with_multiple_columns)

any idea how to do it ?

i tried things like

def my_aggregation_function_with_multiple_columns(slice_values):
    return {'col_1': -1,'col_2': 1}

but this will logically output the dictionary {'col_1': -1,'col_2': 1} in a single column...

It is not possible, because agg working with all columns separately - first process first column, then second.... to the end.

Solution is flexible apply and for return multiple output add Series if output is more scalars.

def my_aggregation_function_with_multiple_columns(slice_values):
    return pd.Series([-1, 1], index=['col_1','col_2'])

df.groupby(by=...).apply(my_aggregation_function_with_multiple_columns)

Sample:

df = pd.DataFrame(dict(A=[1,1,2,2,3], B=[4,5,6,7,2], C=[1,2,4,6,9]))
print (df)

def my_aggregation_function_with_multiple_columns(slice_values):
    #print each group
    #print (slice_values)
    a = slice_values['B'] + slice_values['C'].shift()
    print (type(a))
    return a

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

df = df.groupby('A').apply(my_aggregation_function_with_multiple_columns)
print (df)
A   
1  0     NaN
   1     6.0
2  2     NaN
   3    11.0
3  4     NaN
dtype: float64

The question can be interpreted in multiple ways. The following offers a solution for computing more than one output column, giving the possibility to use a different function for each column.

The example uses the same Pandas DataFrame df as the answer above:

import pandas as pd
df = pd.DataFrame(dict(A=[1,1,2,2,3], B=[4,5,6,7,2], C=[1,2,4,6,9]))

As a function of the groups in A the sum of the values in B is computed and put in one column, and the number of values (count) in B is computed and put in another column.

df.groupby(['A'], as_index=False).agg({'B': {'B1':sum, 'B2': "count"}})

Because dictionaries with renaming will be deprecated in future versions the following code may be better:

df.groupby(['A'], as_index=False).agg({'B': {sum, "count"}})

The next example shows how to do this if you want to have different computations on different columns, for computing the sum of B and mean of C:

df.groupby(['A'], as_index=False).agg({'B': sum, 'C': "mean"})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM