Groupby and compute ratio in pandas

Question

I have the following code to compute the conversion rate by age (the conversion column has two values 1 representing conversion success and 0 failure). But I was wondering if there was a more "elegant" way to do this?

import pandas as pd
import numpy as np

np.random.seed(30)

### MAKE PSEUDODATA
start_date,end_date = '1/1/2015','12/31/2018'
date_rng = pd.date_range(start= start_date, end=end_date, freq='D')
length_of_field = date_rng.shape[0]
df = pd.DataFrame(date_rng, columns=['date'])
df['age'] = np.random.randint(18,100,size=(len(date_rng)))
df['conversion'] = np.random.randint(0,2,size=(len(date_rng)))

### ACTUAL CONVERSION CALCULATION 
conversion_by_age = df.groupby(by='age')['conversion'].agg(['count','sum'])
conversion_by_age['rate'] = df.groupby(by='age')['conversion'].sum()/df.groupby(by='age')['conversion'].count()
print(conversion_by_age)

Answer 1

There's no need to actually perform the groupby many more times once it has been defined. I would use div instead of the operator / for series/df divisions. I would change the last two lines and obtain the same results:

conversion_by_age['rate'] = conversion_by_age['sum'].div(conversion_by_age['count'])
print(conversion_by_age)

Another method, taking only 1 line of code, the rate column can be calculated within the groupby by using a lambda :

conversion_by_age = df.groupby(by='age').apply(lambda x: x['conversion'].sum() / x['conversion'].count())

Time comparison:

Finally, even though lambda is a one liner, it is substantially slower than using .div() . These are the times for 1000 runs:

Method 1 Time: 0.00981671929359436s +/- 0.0007387502003829031
Method 2 Time: 0.015887546062469483 +/- 0.0014185150269994534

Answer 2

'Named aggregation' dictionaries are current best-practice in pandas (requires pandas. version > 0.25.0 ).

However, you still need to calculate 'rate' in a second line, as any one-liner would use non-vectorised pandas operations and be much slower. Overall, I'd suggest:

conversion_by_age = df.groupby(by='age').agg(**{'conversion_count':('conversion','count'), 'conversion_sum':('conversion','sum')})
conversion_by_age['rate'] = conversion_by_age['conversion_sum'].div(conversion_by_age['conversion_count'])

Groupby and compute ratio in pandas

Question

2 answers

solution1
4 ACCPTED 2020-02-07 17:46:44

Time comparison:

solution2
0 2020-02-07 18:28:07

Groupby and compute ratio in pandas

Question

2 answers

solution1 4 ACCPTED 2020-02-07 17:46:44

Time comparison:

solution2 0 2020-02-07 18:28:07

solution1
4 ACCPTED 2020-02-07 17:46:44

solution2
0 2020-02-07 18:28:07