Custom aggregation function using 2 columns in pandas

Question

I have the following table

event_name | score | date      | flag | 
event_1    | 123   | 12APR2018 |  0   |
event_1    | 34    | 05JUN2019 |  0   |
event_1    | 198   | 08APR2020 |  0   |
event_2    | 3     | 14SEP2019 |  0   |
event_2    | 34    | 22DEC2019 |  1   |
event_2    | 90    | 17FEB2020 |  0   | 
event_3    | 772   | 19MAR2021 |  1   |

And I want to obtain

event_name | sum_score | date_flag_1 | 
event_1    | 355       |             | 
event_2    | 127       | 22DEC2019   | 
event_3    | 772       | 19MAR2021   |

where sum_score is the sum of column score for the corresponding event and date_flag_1 is the first date when flag = 1 for the corresponding event. If flag = 0 for all the rows of the current event, date_flag_1 should be missing

I suppose that the code should look something like

df_agg = df.groupby('event_name').agg({'score': 'sum', ['date', 'flag']: my_custom_function})
df_agg.columns = ['event_name', 'sum_score', 'date_flag_1']

However, I am not sure how should I implement my_custom_function , which would be a custom aggregation function that uses two columns instead of one (like other aggregation function). Please help

Answer 1

Aggregate twice and concat the results. The second aggregation you can subset then use the builtin GroupBy.first

import pandas as pd

pd.concat([df.groupby('event_name')['score'].sum(),
           df[df.flag.eq(1)].groupby('event_name')['date'].first().rename('date_flag_1')], 
          axis=1)

#            score date_flag_1
#event_name                   
#event_1       355         NaN
#event_2       127   22DEC2019
#event_3       772   19MAR2021

For illustration, this can be done with a single agg call; however it will be very slow because this requires a lambda x: which will be calcualted as a slow loop over the groups (as opposed to vectorized/cythonized built-in GroupBy operations).

Because .agg only acts on a single Series the hacky work-around is to create a function that accepts both the Series and the DataFrame. You use the Series index to subset the DataFrame (you must have a non-duplicated index for this to work properly) allowing you to then do aggregations that can use multiple columns. This is both overly complicated and slow so I wouldn't do it.

def get_first_date(s, df):
    # rows within group where `s==1`
    res = df.loc[s[s.eq(1)].index, 'date'].dropna()

    if not res.empty:
        return res.iloc[0]
    else:
        return np.NaN

df.groupby('event_name').agg({'score': 'sum', 
                              'flag': lambda x: get_first_date(x, df)})

#            score       flag
#event_name                  
#event_1       355        NaN
#event_2       127  22DEC2019
#event_3       772  19MAR2021

Custom aggregation function using 2 columns in pandas

Question

1 answers

solution1
2 ACCPTED 2021-04-02 15:40:54

Custom aggregation function using 2 columns in pandas

Question

1 answers

solution1 2 ACCPTED 2021-04-02 15:40:54

solution1
2 ACCPTED 2021-04-02 15:40:54