加速適用於 Pandas

Question

東風：

Person,utility,selected,innovation
2012001153_7_E02005533_1_2012002698,130.2333,yes,0
2012001153_7_E02005533_1_2012002698,110.33,no,1
2012001153_7_E02005533_1_2012002698,83,no,2
2012001153_7_E02005533_1_2012002698,-100,no,3
2012001153_7_E02005533_1_2012002698,49,no,4

我希望創建一個新列，在其中找到與所選“即選擇時 ==“是””相關的 dis 實用程序。

以下方法有效，但在數百萬條記錄上執行此操作時速度很慢：

def get_relativeUtilityToSelected(group):
    selected_utility = group[group['selected']=='yes']['utility'].values[0]
    group['relativeDisUtilityToSelected'] = group['utility'] - selected_utility
    return group

df = df.groupby(['person']).apply(get_relativeUtilityToSelected)

預期 output：

Person,utility,selected,innovation,relativeDisUtilityToSelected
2012001153_7_E02005533_1_2012002698,130.2333,yes,0,0
2012001153_7_E02005533_1_2012002698,110.33,no,1,-19.9033
2012001153_7_E02005533_1_2012002698,83,no,2,-47.2333
2012001153_7_E02005533_1_2012002698,-100,no,3,-230.2333
2012001153_7_E02005533_1_2012002698,49,no,4,-81.2333

我怎樣才能加快速度？

Answer 1

如果你有速度問題，你應該完全避免使用.apply ，因為它會循環你的 DataFrame。

注意：從您的 function 中，我推斷每個 'person' 只有一個且只有一行'selected' == 'yes' 。

那應該這樣做：

import pandas as pd
from io import StringIO

ds = """person,utility,selected,innovation
2012001153_7_E02005533_1_2012002698,130.2333,yes,0
2012001153_7_E02005533_1_2012002698,110.33,no,1
2012001153_7_E02005533_1_2012002698,83,no,2
2012001153_7_E02005533_1_2012002698,-100,no,3
2012001153_7_E02005533_1_2012002698,49,no,4"""

df = pd.read_csv(StringIO(ds))

df['relativeDisUtility'] = df['utility'] - df[df['selected'] == 'yes'][['person', 'utility']].set_index('person').loc[df['person']].values[:, 0]

df

結果：

                                person  ...  relativeDisUtility
0  2012001153_7_E02005533_1_2012002698  ...              0.0000
1  2012001153_7_E02005533_1_2012002698  ...            -19.9033
2  2012001153_7_E02005533_1_2012002698  ...            -47.2333
3  2012001153_7_E02005533_1_2012002698  ...           -230.2333
4  2012001153_7_E02005533_1_2012002698  ...            -81.2333

分解：

df[
    df['selected'] == 'yes'  # pick the rows where 'selected' == 'yes'
][
    ['person', 'utility']    # choose 'person' and 'utility' columns
].set_index(
    'person'                 # make 'person' the index
).loc[
    df['person']             # expand to the shape of the original 'person' column
].values[:, 0]               # get the values

加速適用於 Pandas

問題描述

1 個解決方案

解決方案1
0 2022-09-02 11:58:39

加速適用於 Pandas

問題描述

1 個解決方案

解決方案1 0 2022-09-02 11:58:39

解決方案1
0 2022-09-02 11:58:39