I have the following code to compute the conversion rate by age (the conversion column has two values 1 representing conversion success and 0 failure). But I was wondering if there was a more "elegant" way to do this?
import pandas as pd
import numpy as np
np.random.seed(30)
### MAKE PSEUDODATA
start_date,end_date = '1/1/2015','12/31/2018'
date_rng = pd.date_range(start= start_date, end=end_date, freq='D')
length_of_field = date_rng.shape[0]
df = pd.DataFrame(date_rng, columns=['date'])
df['age'] = np.random.randint(18,100,size=(len(date_rng)))
df['conversion'] = np.random.randint(0,2,size=(len(date_rng)))
### ACTUAL CONVERSION CALCULATION
conversion_by_age = df.groupby(by='age')['conversion'].agg(['count','sum'])
conversion_by_age['rate'] = df.groupby(by='age')['conversion'].sum()/df.groupby(by='age')['conversion'].count()
print(conversion_by_age)
There's no need to actually perform the groupby
many more times once it has been defined. I would use div
instead of the operator /
for series/df divisions. I would change the last two lines and obtain the same results:
conversion_by_age['rate'] = conversion_by_age['sum'].div(conversion_by_age['count'])
print(conversion_by_age)
Another method, taking only 1 line of code, the rate
column can be calculated within the groupby
by using a lambda
:
conversion_by_age = df.groupby(by='age').apply(lambda x: x['conversion'].sum() / x['conversion'].count())
Finally, even though lambda
is a one liner, it is substantially slower than using .div()
. These are the times for 1000 runs:
'Named aggregation' dictionaries are current best-practice in pandas (requires pandas. version > 0.25.0 ).
However, you still need to calculate 'rate'
in a second line, as any one-liner would use non-vectorised pandas operations and be much slower. Overall, I'd suggest:
conversion_by_age = df.groupby(by='age').agg(**{'conversion_count':('conversion','count'), 'conversion_sum':('conversion','sum')})
conversion_by_age['rate'] = conversion_by_age['conversion_sum'].div(conversion_by_age['conversion_count'])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.