[英]Annotate each row with percent of total for group by, in pandas?
I have a dataframe that looks like this: 我有一个看起来像这样的数据框:
Company Speciality Payment
AcmeCorp Roofing 50.00
AcmeCorp Grounding 50.00
LolCorp Roofing 106.00
LolCorp Grounding 94.00
I'd like to add a percentage column like this: 我想添加一个百分比列,如下所示:
Company Speciality Payment Percent of Total Payment
AcmeCorp Roofing 50.00 50
AcmeCorp Grounding 50.00 50
LolCorp Roofing 106.00 53
LolCorp Grounding 94.00 47
What's the best way to do this? 最好的方法是什么?
I could do it messily using something like this: 我可以使用以下方式杂乱地做到这一点:
df_m = df.groupby('Company').sum()
final_df = pd.merge(df, df_m, on='Company', suffixes=('Raw', 'Total))
final_df['Percent of Total Payment] = final_df['Payment Raw'] / final_df['Payment_Total']
But I wonder if there's a more efficient way. 但我想知道是否有更有效的方法。
Use groupby/transform
to produce a column of the same length as the original DataFrame. 使用
groupby/transform
产生与原始DataFrame相同长度的列。 This allows you to avoid calling pd.merge
. 这样可以避免调用
pd.merge
。
import numpy as np
import pandas as pd
df = pd.DataFrame({'Company': ['AcmeCorp', 'AcmeCorp', 'LolCorp', 'LolCorp'],
'Payment': [50.0, 50.0, 106, 94.00],
'Speciality': ['Roofing', 'Grounding', 'Roofing', 'Grounding']})
total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total
print(df)
yields 产量
Company Payment Speciality percent
0 AcmeCorp 50.0 Roofing 0.50
1 AcmeCorp 50.0 Grounding 0.50
2 LolCorp 106.0 Roofing 0.53
3 LolCorp 94.0 Grounding 0.47
Although 虽然
total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total
could be reduced to a one-liner, 可以减少到一线
df['percent'] = df.groupby('Company')['Payment'].transform(lambda x: x/x.sum())
because builtin operations like .transform('sum')
are faster than those with custom functions (eg .transform(lambda x: x/x.sum())
), the two-line version is faster (particularly for large DataFrames.) 因为诸如
.transform('sum')
类的内置操作比带有自定义函数(例如.transform(lambda x: x/x.sum())
)的内置操作要快,所以两行版本的速度更快(尤其是对于大型DataFrame)。
And, of course, the two-line version could also be written as 而且,当然,两行版本也可以写成
df['percent'] = df['Payment'] / df.groupby('Company')['Payment'].transform('sum')
with no loss in speed, one less named variable, but perhaps a bit harder to read. 在速度上没有损失,没有那么多名字的变量,但也许有点难读。
Here's a benchmark on a 100K-row DataFrame: 这是一个10万行DataFrame的基准测试:
In [53]: %timeit using_transform(df)
100 loops, best of 3: 8.5 ms per loop
In [54]: %timeit using_one_liner(df)
10 loops, best of 3: 20.2 ms per loop
In [55]: %timeit orig(df)
10 loops, best of 3: 30.2 ms per loop
This is the setup used to perform the benchmark. 这是用于执行基准测试的设置。
import numpy as np
import pandas as pd
N = 10**5
df = pd.DataFrame({'Company': np.random.choice(list('ABCD'), size=N),
'Payment': np.random.randint(10, size=N),
'Speciality': np.random.choice(list('XYZ'), size=N)})
def using_transform(df):
total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total
return df
def using_one_liner(df):
df['percent'] = df.groupby('Company')['Payment'].transform(lambda x: x/x.sum())
return df
def orig(df):
df_m = df.groupby('Company').sum()
final_df = pd.merge(df, df_m, left_on='Company', right_index=True, suffixes=('_Raw', '_Total'))
final_df['Percent of Total Payment'] = final_df['Payment_Raw'] / final_df['Payment_Total']
return final_df
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.