简体   繁体   中英

Custom aggregation function using 2 columns in pandas

I have the following table

event_name | score | date      | flag | 
event_1    | 123   | 12APR2018 |  0   |
event_1    | 34    | 05JUN2019 |  0   |
event_1    | 198   | 08APR2020 |  0   |
event_2    | 3     | 14SEP2019 |  0   |
event_2    | 34    | 22DEC2019 |  1   |
event_2    | 90    | 17FEB2020 |  0   | 
event_3    | 772   | 19MAR2021 |  1   |

And I want to obtain

event_name | sum_score | date_flag_1 | 
event_1    | 355       |             | 
event_2    | 127       | 22DEC2019   | 
event_3    | 772       | 19MAR2021   | 

where sum_score is the sum of column score for the corresponding event and date_flag_1 is the first date when flag = 1 for the corresponding event. If flag = 0 for all the rows of the current event, date_flag_1 should be missing

I suppose that the code should look something like

df_agg = df.groupby('event_name').agg({'score': 'sum', ['date', 'flag']: my_custom_function})
df_agg.columns = ['event_name', 'sum_score', 'date_flag_1']

However, I am not sure how should I implement my_custom_function , which would be a custom aggregation function that uses two columns instead of one (like other aggregation function). Please help

Aggregate twice and concat the results. The second aggregation you can subset then use the builtin GroupBy.first

import pandas as pd

pd.concat([df.groupby('event_name')['score'].sum(),
           df[df.flag.eq(1)].groupby('event_name')['date'].first().rename('date_flag_1')], 
          axis=1)

#            score date_flag_1
#event_name                   
#event_1       355         NaN
#event_2       127   22DEC2019
#event_3       772   19MAR2021

For illustration, this can be done with a single agg call; however it will be very slow because this requires a lambda x: which will be calcualted as a slow loop over the groups (as opposed to vectorized/cythonized built-in GroupBy operations).

Because .agg only acts on a single Series the hacky work-around is to create a function that accepts both the Series and the DataFrame. You use the Series index to subset the DataFrame (you must have a non-duplicated index for this to work properly) allowing you to then do aggregations that can use multiple columns. This is both overly complicated and slow so I wouldn't do it.

def get_first_date(s, df):
    # rows within group where `s==1`
    res = df.loc[s[s.eq(1)].index, 'date'].dropna()

    if not res.empty:
        return res.iloc[0]
    else:
        return np.NaN

df.groupby('event_name').agg({'score': 'sum', 
                              'flag': lambda x: get_first_date(x, df)})

#            score       flag
#event_name                  
#event_1       355        NaN
#event_2       127  22DEC2019
#event_3       772  19MAR2021

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM