简体   繁体   English

用大熊猫的分组依据注释每一行吗?

[英]Annotate each row with percent of total for group by, in pandas?

I have a dataframe that looks like this: 我有一个看起来像这样的数据框:

Company       Speciality      Payment
AcmeCorp      Roofing         50.00
AcmeCorp      Grounding       50.00
LolCorp       Roofing         106.00
LolCorp       Grounding       94.00

I'd like to add a percentage column like this: 我想添加一个百分比列,如下所示:

Company       Speciality      Payment     Percent of Total Payment
AcmeCorp      Roofing         50.00       50
AcmeCorp      Grounding       50.00       50
LolCorp       Roofing         106.00      53
LolCorp       Grounding       94.00       47

What's the best way to do this? 最好的方法是什么?

I could do it messily using something like this: 我可以使用以下方式杂乱地做到这一点:

df_m = df.groupby('Company').sum()
final_df = pd.merge(df, df_m, on='Company', suffixes=('Raw', 'Total))
final_df['Percent of Total Payment] = final_df['Payment Raw'] / final_df['Payment_Total']

But I wonder if there's a more efficient way. 但我想知道是否有更有效的方法。

Use groupby/transform to produce a column of the same length as the original DataFrame. 使用groupby/transform产生与原始DataFrame相同长度的列。 This allows you to avoid calling pd.merge . 这样可以避免调用pd.merge

import numpy as np
import pandas as pd

df = pd.DataFrame({'Company': ['AcmeCorp', 'AcmeCorp', 'LolCorp', 'LolCorp'],
 'Payment': [50.0, 50.0, 106, 94.00],
 'Speciality': ['Roofing', 'Grounding', 'Roofing', 'Grounding']})

total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total
print(df)

yields 产量

    Company  Payment Speciality  percent
0  AcmeCorp     50.0    Roofing     0.50
1  AcmeCorp     50.0  Grounding     0.50
2   LolCorp    106.0    Roofing     0.53
3   LolCorp     94.0  Grounding     0.47

Although 虽然

total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total

could be reduced to a one-liner, 可以减少到一线

df['percent'] = df.groupby('Company')['Payment'].transform(lambda x: x/x.sum())

because builtin operations like .transform('sum') are faster than those with custom functions (eg .transform(lambda x: x/x.sum()) ), the two-line version is faster (particularly for large DataFrames.) 因为诸如.transform('sum')类的内置操作比带有自定义函数(例如.transform(lambda x: x/x.sum()) )的内置操作要快,所以两行版本的速度更快(尤其是对于大型DataFrame)。

And, of course, the two-line version could also be written as 而且,当然,两行版本也可以写成

df['percent'] = df['Payment'] / df.groupby('Company')['Payment'].transform('sum')

with no loss in speed, one less named variable, but perhaps a bit harder to read. 在速度上没有损失,没有那么多名字的变量,但也许有点难读。


Here's a benchmark on a 100K-row DataFrame: 这是一个10万行DataFrame的基准测试:

In [53]: %timeit using_transform(df)
100 loops, best of 3: 8.5 ms per loop

In [54]: %timeit using_one_liner(df)
10 loops, best of 3: 20.2 ms per loop

In [55]: %timeit orig(df)
10 loops, best of 3: 30.2 ms per loop

This is the setup used to perform the benchmark. 这是用于执行基准测试的设置。

import numpy as np
import pandas as pd

N = 10**5
df = pd.DataFrame({'Company': np.random.choice(list('ABCD'), size=N),
    'Payment': np.random.randint(10, size=N),
    'Speciality': np.random.choice(list('XYZ'), size=N)})

def using_transform(df):
    total = df.groupby('Company')['Payment'].transform('sum')
    df['percent'] = df['Payment']/total
    return df

def using_one_liner(df):
    df['percent'] = df.groupby('Company')['Payment'].transform(lambda x: x/x.sum())
    return df

def orig(df):
    df_m = df.groupby('Company').sum()
    final_df = pd.merge(df, df_m, left_on='Company', right_index=True, suffixes=('_Raw', '_Total'))
    final_df['Percent of Total Payment'] = final_df['Payment_Raw'] / final_df['Payment_Total']
    return final_df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM