简体   繁体   中英

increase efficiency of pandas groupby with custom aggregation function

I have a not so large dataframe (somewhere in 2000x10000 range in terms of shape).

I am trying to groupby a columns, and average the first N non-null entries:

eg

def my_part_of_interest(v,N=42):
   valid=v[~np.isnan(v)]
   return np.mean(valid.values[0:N])

mydf.groupby('key').agg(my_part_of_interest)

It now take a long time (dozen of minutes), when .agg(np.nanmean) was instead in order of seconds.

how to get it running faster?

Some things to consider:

  1. Droping the nan entries on the entire df via single operation is faster than doing it on chunks of grouped datasets mydf.dropna(subset=['v'], inplace=True)
  2. Use the .head to slice mydf.groupby('key').apply(lambda x: x.head(42).agg('mean')

I think those combined can optimize things a bit and they are more idiomatic to pandas.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM