pandas 应用 function 性能优化

Question

I have following df我有以下 df

activity活动	region地区	empPeople员工
12122 12122	1101 1101	2 2
23322 23322	1233 1233	40 40
22223 22223	2323 2323	0 0
... ...	... ...	... ...

I want to create a column RCA which takes value 1 if (empPeople/TotalEmpRegion) / (totalEmpActivity / totalEmp) > 1 and 0 otherwise.我想创建一个列 RCA，如果 (empPeople/TotalEmpRegion) / (totalEmpActivity / totalEmp) > 1 则取值为 1，否则取值为 0。 Then I will transform this df to pivot table with index=region and column=activity and values=rca.然后我将把这个df转换成pivot表，index=region，column=activity，values=rca。

I wrote the following function我写了以下 function

def rca_emp(activity:str, region:str , emp:float):
    top = emp / df[df['region'] == region].empPeople.sum()
    bottom = df[df['activity'] == activity].empPeople.sum() / df.empPeople.sum()
    rca = top/bottom
    if rca > 1: 
        return 1
    else:
        return 0

Then I used apply method to create a column rca然后我使用 apply 方法创建一个列 rca

# finding RCA
df['rca'] = df.apply(lambda x : rca_emp(activity=x['activity'] , region=x['region'] , emp=x['empPeople']) , axis=1)
# create a binary matrix
df.pivot(index='region', columns='activity', values='rca')

The issue is that apply function takes too much time (6047 seconds).问题是 apply function 需要太多时间（6047 秒）。 I was wondering is there a faster way to accomplish this task?我想知道有没有更快的方法来完成这项任务？

Answer 1

Instead your function use GroupBy.transform with sum and create 0,1 in numpy.where :相反，您的 function 使用GroupBy.transform和sum并在numpy.where中创建0,1 ：

s1 = df.groupby('activity')['empPeople'].transform('sum')
s2 = df.groupby('region')['empPeople'].transform('sum')

df['rca'] = np.where((df['empPeople'] / s2)  / (s1 / df.empPeople.sum()) > 1, 1, 0)

Testing ouput:测试输出：

print (df)
   activity  region  empPeople
0     12122    1101          2
1     23322    1233         40
2     22223    2323          0
3     12122    1101          1
4     23322    1233          4
5     22223    2323          6



def rca_emp(activity:str, region:str , emp:float):
    top = emp / df[df['region'] == region].empPeople.sum()
    bottom = df[df['activity'] == activity].empPeople.sum() / df.empPeople.sum()
    rca = top /bottom
    if rca > 1: 
        return 1
    else:
        return 0


df['rca'] = df.apply(lambda x : rca_emp(activity=x['activity'] , region=x['region'] , emp=x['empPeople']) , axis=1)

s1 = df.groupby(['activity'])['empPeople'].transform('sum')
s2 = df.groupby(['region'])['empPeople'].transform('sum')

df['rca1'] = np.where((df['empPeople'] / s2)  / (s1 / df.empPeople.sum())  > 1, 1, 0)
print (df)
   activity  region  empPeople  rca  rca1
0     12122    1101          2    1     1
1     23322    1233         40    1     1
2     22223    2323          0    0     0
3     12122    1101          1    1     1
4     23322    1233          4    0     0
5     22223    2323          6    1     1

pandas 应用 function 性能优化

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-09-12 10:25:07

pandas 应用 function 性能优化

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-09-12 10:25:07

解决方案1
1 已采纳 2022-09-12 10:25:07