[英]pandas apply function performance optimazation
I have following df我有以下 df
activity活动 | region地区 | empPeople员工 |
---|---|---|
12122 12122 | 1101 1101 | 2 2 |
23322 23322 | 1233 1233 | 40 40 |
22223 22223 | 2323 2323 | 0 0 |
... ... | ... ... | ... ... |
I want to create a column RCA which takes value 1 if (empPeople/TotalEmpRegion) / (totalEmpActivity / totalEmp) > 1 and 0 otherwise.我想创建一个列 RCA,如果 (empPeople/TotalEmpRegion) / (totalEmpActivity / totalEmp) > 1 则取值为 1,否则取值为 0。 Then I will transform this df to pivot table with index=region and column=activity and values=rca.然后我将把这个df转换成pivot表,index=region,column=activity,values=rca。
I wrote the following function我写了以下 function
def rca_emp(activity:str, region:str , emp:float):
top = emp / df[df['region'] == region].empPeople.sum()
bottom = df[df['activity'] == activity].empPeople.sum() / df.empPeople.sum()
rca = top/bottom
if rca > 1:
return 1
else:
return 0
Then I used apply method to create a column rca然后我使用 apply 方法创建一个列 rca
# finding RCA
df['rca'] = df.apply(lambda x : rca_emp(activity=x['activity'] , region=x['region'] , emp=x['empPeople']) , axis=1)
# create a binary matrix
df.pivot(index='region', columns='activity', values='rca')
The issue is that apply function takes too much time (6047 seconds).问题是 apply function 需要太多时间(6047 秒)。 I was wondering is there a faster way to accomplish this task?我想知道有没有更快的方法来完成这项任务?
Instead your function use GroupBy.transform
with sum
and create 0,1
in numpy.where
:相反,您的 function 使用GroupBy.transform
和sum
并在numpy.where
中创建0,1
:
s1 = df.groupby('activity')['empPeople'].transform('sum')
s2 = df.groupby('region')['empPeople'].transform('sum')
df['rca'] = np.where((df['empPeople'] / s2) / (s1 / df.empPeople.sum()) > 1, 1, 0)
Testing ouput:测试输出:
print (df)
activity region empPeople
0 12122 1101 2
1 23322 1233 40
2 22223 2323 0
3 12122 1101 1
4 23322 1233 4
5 22223 2323 6
def rca_emp(activity:str, region:str , emp:float):
top = emp / df[df['region'] == region].empPeople.sum()
bottom = df[df['activity'] == activity].empPeople.sum() / df.empPeople.sum()
rca = top /bottom
if rca > 1:
return 1
else:
return 0
df['rca'] = df.apply(lambda x : rca_emp(activity=x['activity'] , region=x['region'] , emp=x['empPeople']) , axis=1)
s1 = df.groupby(['activity'])['empPeople'].transform('sum')
s2 = df.groupby(['region'])['empPeople'].transform('sum')
df['rca1'] = np.where((df['empPeople'] / s2) / (s1 / df.empPeople.sum()) > 1, 1, 0)
print (df)
activity region empPeople rca rca1
0 12122 1101 2 1 1
1 23322 1233 40 1 1
2 22223 2323 0 0 0
3 12122 1101 1 1 1
4 23322 1233 4 0 0
5 22223 2323 6 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.