简体   繁体   中英

Pandas groupby aggregate passing group name to aggregate

In a common usage pattern, I need to aggregate a DataFrame using a custom aggregate function. In this special case, the aggregate function needs to know the current group in order to correctly perform the aggregation.

A function passed to DataFrameGroupBy.aggregate() is called for each group and for each column, receiving the Series with the elements in current group and column. The only way I found to get the group name from inside the aggregate function is adding the grouping column to the index and then extracting the value with x.index.get_level_values('power')[0] . Here an example:

def _tail_mean_user_th(x):
    power = x.index.get_level_values('power')[0]
    th = th_dict[power]  # this values changes with the group
    return x.loc[x > th].mean() - th

mbsize_df = (bursts_sel.set_index('power', append=True).groupby('power')
             .agg({'nt': _tail_mean_user_th}))

It seems to me that it is a pretty common occurrence that the aggregate function needs to know the current group. Is there a more straightforward pattern in this situation?


EDIT : The solution that I accepted below consists in using apply instead of agg on the GroupBy object. The difference between the two is that agg calls the function for each group and each column separately, while apply calls the function for each group (all columns at once). A subtle consequence of this is that agg will pass a Series for current group and column with its name attribute equal to the original column name. Conversely, apply will pass a Series with a name attribute equal to the current group (which was my question). Interestingly, when operating on multiple columns, apply will pass a DataFrame with a name attribute (normally non-existent for DataFrames) set to the group name. So this pattern also works when aggregating multiple columns at once.

For more info see What is the difference between pandas agg and apply function?

If you use groupby + apply , then it is available through the .name attribute:

df = pd.DataFrame({'a': [1, 2, 1, 2], 'b': [1, 1, 2, 2]})
def foo(g):
    print('at group %s' % g.name)
    return int(g.name) + g.sum()    

>>> df.b.groupby(df.a).apply(foo)
at group 1
at group 2
a
1    4
2    5
Name: b, dtype: int64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM