increase efficiency of pandas groupby with custom aggregation function

Question

I have a not so large dataframe (somewhere in 2000x10000 range in terms of shape).

I am trying to groupby a columns, and average the first N non-null entries:

eg

def my_part_of_interest(v,N=42):
   valid=v[~np.isnan(v)]
   return np.mean(valid.values[0:N])

mydf.groupby('key').agg(my_part_of_interest)

It now take a long time (dozen of minutes), when .agg(np.nanmean) was instead in order of seconds.

how to get it running faster?

Answer 1

Some things to consider:

Droping the nan entries on the entire df via single operation is faster than doing it on chunks of grouped datasets mydf.dropna(subset=['v'], inplace=True)
Use the .head to slice mydf.groupby('key').apply(lambda x: x.head(42).agg('mean')

I think those combined can optimize things a bit and they are more idiomatic to pandas.