簡體   English   中英

加快熊貓分組申請

[英]Speed up pandas groupby apply

我有一個數據框,我想按一列對其進行分組,同時對其應用許多功能。 不幸的是,這只花了太長時間。 我需要十倍的改進。 我已經讀過關於矢量化的知識,但是我失去了許多熊貓功能。

這是我的方法,首先定義所有需要的功能:

def f(x):
    d = {}
    d['min_min_approved'] = x['scoring_dol_amount'][x['payment_status']=='approved'].min()
    d['max_max_approved'] = x['scoring_dol_amount'][x['payment_status']=='approved'].max()
    d['sum_approved'] = x['scoring_dol_amount'][x['payment_status']=='approved'].sum()
    d['avg_approved'] = x['scoring_dol_amount'][x['payment_status']=='approved'].mean()
    d['std_approved'] = x['scoring_dol_amount'][x['payment_status']=='approved'].std()
    d['sum_approved_tpn'] = x['scoring_dol_amount'][x['payment_status']=='approved'].count()
    d['sum_rejected_tpn'] = x['scoring_dol_amount'][x['payment_status']=='rejected'].count()
    d['sum_rejected_tpn_hr'] = x['scoring_dol_amount'][x['payment_status_detail']=='cc_rejected_high_risk'].count()
    d['sum_rejected'] = x['scoring_dol_amount'][x['payment_status']=='rejected'].sum()
    d['sum_rejected_hr'] = x['scoring_dol_amount'][x['payment_status_detail']=='cc_rejected_high_risk'].sum()
    d['avg_rejected'] = x['scoring_dol_amount'][x['payment_status']=='rejected'].mean()
    d['std_rejected'] = x['scoring_dol_amount'][x['payment_status']=='approved'].std()
    d['sum_late_hours'] = x['scoring_dol_amount'][(x['payment_date_created'].dt.hour >=23) | (x['payment_date_created'].dt.hour <=6)].count()
    #d['ratio_receive'] = (x['scoring_dol_amount'][x['payment_status']=='approved'].sum())/(x['scoring_dol_amount'][x['payment_status']=='rejected'].sum()+x['scoring_dol_amount'][x['payment_status']=='approved'].sum())
    #d['ratio_receive_tpn'] = (x['scoring_dol_amount'][x['payment_status']=='approved'].count())/(x['scoring_dol_amount'][x['payment_status']=='rejected'].count()+x['scoring_dol_amount'][x['payment_status']=='approved'].count())
    #d['distinct_tc']= x['tc'].nunique()
    #d['distinct_doc']= x['payer_identification_number'].nunique()
    #d['ratio_tc']= (x['tc'].nunique())/(x['scoring_dol_amount'][x['payment_status']=='approved'].count())
    #d['ratio_doc']= (x['payer_identification_number'].nunique())/(x['scoring_dol_amount'][x['payment_status']=='approved'].count())

    return pd.Series(d, index=['min_min_approved', 'max_max_approved', 'sum_approved', 'avg_approved','std_approved','sum_approved_tpn','sum_rejected_tpn','sum_rejected_tpn_hr','sum_rejected','sum_rejected_hr','avg_rejected','std_rejected','sum_late_hours'])#,'ratio_receive','ratio_receive_tpn','distinct_tc','distinct_doc','ratio_tc','ratio_doc'])

我以這種方式應用它:

dataset_recibido=dataset_recibido.set_index('cust_id')
dataset_recibido.groupby(dataset_recibido.index).apply(f)

我該如何加快速度?

好像您建立了一些已經包含在熊貓中的東西。 僅使用您當前正在過濾的groupby() cust_id和payment_status列並使用agg()

dataset_recibido.groupby(['cust_id','payment_status']])\
                          .agg(['count','mean','std','sum','min','max'])

內置函數是快於定制的apply ,你的情況,你可以使用3個人groupby使用payment_statuspayment_status_detailpayment_date_created的關鍵:

group1 = x.groupby(["cust_id", "payment_status"])
stats1 = group1['scoring_dol_amount'].agg(["mean", "std", "sum", "min", "max", "count"])

group2 = x.groupby(["cust_id", "payment_status_detail"])
stats2 = group2['scoring_dol_amount'].agg(["sum", "count"])

group3 = x.groupby(["cust_id", (x['payment_date_created'].dt.hour >=23) | (x['payment_date_created'].dt.hour <=6)])
stats3 = group3['scoring_dol_amount'].count()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM