I have the following table
event_name | score | date | flag |
event_1 | 123 | 12APR2018 | 0 |
event_1 | 34 | 05JUN2019 | 0 |
event_1 | 198 | 08APR2020 | 0 |
event_2 | 3 | 14SEP2019 | 0 |
event_2 | 34 | 22DEC2019 | 1 |
event_2 | 90 | 17FEB2020 | 0 |
event_3 | 772 | 19MAR2021 | 1 |
And I want to obtain
event_name | sum_score | date_flag_1 |
event_1 | 355 | |
event_2 | 127 | 22DEC2019 |
event_3 | 772 | 19MAR2021 |
where sum_score
is the sum of column score for the corresponding event and date_flag_1
is the first date when flag
= 1 for the corresponding event. If flag
= 0 for all the rows of the current event, date_flag_1
should be missing
I suppose that the code should look something like
df_agg = df.groupby('event_name').agg({'score': 'sum', ['date', 'flag']: my_custom_function})
df_agg.columns = ['event_name', 'sum_score', 'date_flag_1']
However, I am not sure how should I implement my_custom_function
, which would be a custom aggregation function that uses two columns instead of one (like other aggregation function). Please help
Aggregate twice and concat
the results. The second aggregation you can subset then use the builtin GroupBy.first
import pandas as pd
pd.concat([df.groupby('event_name')['score'].sum(),
df[df.flag.eq(1)].groupby('event_name')['date'].first().rename('date_flag_1')],
axis=1)
# score date_flag_1
#event_name
#event_1 355 NaN
#event_2 127 22DEC2019
#event_3 772 19MAR2021
For illustration, this can be done with a single agg
call; however it will be very slow because this requires a lambda x:
which will be calcualted as a slow loop over the groups (as opposed to vectorized/cythonized built-in GroupBy operations).
Because .agg
only acts on a single Series the hacky work-around is to create a function that accepts both the Series and the DataFrame. You use the Series index to subset the DataFrame (you must have a non-duplicated index for this to work properly) allowing you to then do aggregations that can use multiple columns. This is both overly complicated and slow so I wouldn't do it.
def get_first_date(s, df):
# rows within group where `s==1`
res = df.loc[s[s.eq(1)].index, 'date'].dropna()
if not res.empty:
return res.iloc[0]
else:
return np.NaN
df.groupby('event_name').agg({'score': 'sum',
'flag': lambda x: get_first_date(x, df)})
# score flag
#event_name
#event_1 355 NaN
#event_2 127 22DEC2019
#event_3 772 19MAR2021
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.