[英]Annotate each row with percent of total for group by, in pandas?
我有一个看起来像这样的数据框:
Company Speciality Payment
AcmeCorp Roofing 50.00
AcmeCorp Grounding 50.00
LolCorp Roofing 106.00
LolCorp Grounding 94.00
我想添加一个百分比列,如下所示:
Company Speciality Payment Percent of Total Payment
AcmeCorp Roofing 50.00 50
AcmeCorp Grounding 50.00 50
LolCorp Roofing 106.00 53
LolCorp Grounding 94.00 47
最好的方法是什么?
我可以使用以下方式杂乱地做到这一点:
df_m = df.groupby('Company').sum()
final_df = pd.merge(df, df_m, on='Company', suffixes=('Raw', 'Total))
final_df['Percent of Total Payment] = final_df['Payment Raw'] / final_df['Payment_Total']
但我想知道是否有更有效的方法。
使用groupby/transform
产生与原始DataFrame相同长度的列。 这样可以避免调用pd.merge
。
import numpy as np
import pandas as pd
df = pd.DataFrame({'Company': ['AcmeCorp', 'AcmeCorp', 'LolCorp', 'LolCorp'],
'Payment': [50.0, 50.0, 106, 94.00],
'Speciality': ['Roofing', 'Grounding', 'Roofing', 'Grounding']})
total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total
print(df)
产量
Company Payment Speciality percent
0 AcmeCorp 50.0 Roofing 0.50
1 AcmeCorp 50.0 Grounding 0.50
2 LolCorp 106.0 Roofing 0.53
3 LolCorp 94.0 Grounding 0.47
虽然
total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total
可以减少到一线
df['percent'] = df.groupby('Company')['Payment'].transform(lambda x: x/x.sum())
因为诸如.transform('sum')
类的内置操作比带有自定义函数(例如.transform(lambda x: x/x.sum())
)的内置操作要快,所以两行版本的速度更快(尤其是对于大型DataFrame)。
而且,当然,两行版本也可以写成
df['percent'] = df['Payment'] / df.groupby('Company')['Payment'].transform('sum')
在速度上没有损失,没有那么多名字的变量,但也许有点难读。
这是一个10万行DataFrame的基准测试:
In [53]: %timeit using_transform(df)
100 loops, best of 3: 8.5 ms per loop
In [54]: %timeit using_one_liner(df)
10 loops, best of 3: 20.2 ms per loop
In [55]: %timeit orig(df)
10 loops, best of 3: 30.2 ms per loop
这是用于执行基准测试的设置。
import numpy as np
import pandas as pd
N = 10**5
df = pd.DataFrame({'Company': np.random.choice(list('ABCD'), size=N),
'Payment': np.random.randint(10, size=N),
'Speciality': np.random.choice(list('XYZ'), size=N)})
def using_transform(df):
total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total
return df
def using_one_liner(df):
df['percent'] = df.groupby('Company')['Payment'].transform(lambda x: x/x.sum())
return df
def orig(df):
df_m = df.groupby('Company').sum()
final_df = pd.merge(df, df_m, left_on='Company', right_index=True, suffixes=('_Raw', '_Total'))
final_df['Percent of Total Payment'] = final_df['Payment_Raw'] / final_df['Payment_Total']
return final_df
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.