[英]Speed up apply in Pandas
東風:
Person,utility,selected,innovation
2012001153_7_E02005533_1_2012002698,130.2333,yes,0
2012001153_7_E02005533_1_2012002698,110.33,no,1
2012001153_7_E02005533_1_2012002698,83,no,2
2012001153_7_E02005533_1_2012002698,-100,no,3
2012001153_7_E02005533_1_2012002698,49,no,4
我希望創建一個新列,在其中找到與所選“即選擇時 ==“是””相關的 dis 實用程序。
以下方法有效,但在數百萬條記錄上執行此操作時速度很慢:
def get_relativeUtilityToSelected(group):
selected_utility = group[group['selected']=='yes']['utility'].values[0]
group['relativeDisUtilityToSelected'] = group['utility'] - selected_utility
return group
df = df.groupby(['person']).apply(get_relativeUtilityToSelected)
預期 output:
Person,utility,selected,innovation,relativeDisUtilityToSelected
2012001153_7_E02005533_1_2012002698,130.2333,yes,0,0
2012001153_7_E02005533_1_2012002698,110.33,no,1,-19.9033
2012001153_7_E02005533_1_2012002698,83,no,2,-47.2333
2012001153_7_E02005533_1_2012002698,-100,no,3,-230.2333
2012001153_7_E02005533_1_2012002698,49,no,4,-81.2333
我怎樣才能加快速度?
如果你有速度問題,你應該完全避免使用.apply
,因為它會循環你的 DataFrame。
注意:從您的 function 中,我推斷每個 'person' 只有一個且只有一行'selected' == 'yes'
。
那應該這樣做:
import pandas as pd
from io import StringIO
ds = """person,utility,selected,innovation
2012001153_7_E02005533_1_2012002698,130.2333,yes,0
2012001153_7_E02005533_1_2012002698,110.33,no,1
2012001153_7_E02005533_1_2012002698,83,no,2
2012001153_7_E02005533_1_2012002698,-100,no,3
2012001153_7_E02005533_1_2012002698,49,no,4"""
df = pd.read_csv(StringIO(ds))
df['relativeDisUtility'] = df['utility'] - df[df['selected'] == 'yes'][['person', 'utility']].set_index('person').loc[df['person']].values[:, 0]
df
結果:
person ... relativeDisUtility
0 2012001153_7_E02005533_1_2012002698 ... 0.0000
1 2012001153_7_E02005533_1_2012002698 ... -19.9033
2 2012001153_7_E02005533_1_2012002698 ... -47.2333
3 2012001153_7_E02005533_1_2012002698 ... -230.2333
4 2012001153_7_E02005533_1_2012002698 ... -81.2333
分解:
df[
df['selected'] == 'yes' # pick the rows where 'selected' == 'yes'
][
['person', 'utility'] # choose 'person' and 'utility' columns
].set_index(
'person' # make 'person' the index
).loc[
df['person'] # expand to the shape of the original 'person' column
].values[:, 0] # get the values
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.