简体   繁体   中英

Groupby and compute ratio in pandas

I have the following code to compute the conversion rate by age (the conversion column has two values 1 representing conversion success and 0 failure). But I was wondering if there was a more "elegant" way to do this?

import pandas as pd
import numpy as np

np.random.seed(30)

### MAKE PSEUDODATA
start_date,end_date = '1/1/2015','12/31/2018'
date_rng = pd.date_range(start= start_date, end=end_date, freq='D')
length_of_field = date_rng.shape[0]
df = pd.DataFrame(date_rng, columns=['date'])
df['age'] = np.random.randint(18,100,size=(len(date_rng)))
df['conversion'] = np.random.randint(0,2,size=(len(date_rng)))

### ACTUAL CONVERSION CALCULATION 
conversion_by_age = df.groupby(by='age')['conversion'].agg(['count','sum'])
conversion_by_age['rate'] = df.groupby(by='age')['conversion'].sum()/df.groupby(by='age')['conversion'].count()
print(conversion_by_age)

There's no need to actually perform the groupby many more times once it has been defined. I would use div instead of the operator / for series/df divisions. I would change the last two lines and obtain the same results:

conversion_by_age['rate'] = conversion_by_age['sum'].div(conversion_by_age['count'])
print(conversion_by_age)

Another method, taking only 1 line of code, the rate column can be calculated within the groupby by using a lambda :

conversion_by_age = df.groupby(by='age').apply(lambda x: x['conversion'].sum() / x['conversion'].count())

Time comparison:

Finally, even though lambda is a one liner, it is substantially slower than using .div() . These are the times for 1000 runs:

  1. Method 1 Time: 0.00981671929359436s +/- 0.0007387502003829031
  2. Method 2 Time: 0.015887546062469483 +/- 0.0014185150269994534

'Named aggregation' dictionaries are current best-practice in pandas (requires pandas. version > 0.25.0 ).

However, you still need to calculate 'rate' in a second line, as any one-liner would use non-vectorised pandas operations and be much slower. Overall, I'd suggest:

conversion_by_age = df.groupby(by='age').agg(**{'conversion_count':('conversion','count'), 'conversion_sum':('conversion','sum')})
conversion_by_age['rate'] = conversion_by_age['conversion_sum'].div(conversion_by_age['conversion_count'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM