簡體   English   中英

python pandas:矢量化時間序列窗口函數

[英]python pandas: vectorized time series window function

我有以下格式的熊貓數據框:

'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
3,2017-07-15,thing3,55,17
3,2016-05-12,thing3,55,47
4,2012-02-23,thing2,150,22
4,2009-10-10,thing1,25,12
4,2014-04-04,thing2,150,2
5,2008-07-09,thing2,150,43

我編寫了以下內容以創建兩個新字段,以指示30天的有效期:

import numpy as np
import pandas as pd

start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')

def find_window_start_date(x):
    window_start_date_idx = np.argmax(x < start_date_period.end_time)
    return start_date_period[window_start_date_idx]

df['window_start_dt'] = df['transaction_dt'].apply(find_window_start_date)

def find_window_end_date(x):
    window_end_date_idx = np.argmin(x > end_date_period.start_time)
    return end_date_period[window_end_date_idx]

df['window_end_dt'] = df['transaction_dt'].apply(find_window_end_date)

不幸的是,這對我的應用程序逐行申請來說太慢了。 如果可能的話,我將不勝感激將這些功能向量化的任何技巧。

編輯:

結果數據框應具有以下布局:

'customer_id','transaction_dt','product','price','units','window_start_dt','window_end_dt'

從形式上講,它不需要重新采樣或加窗。 它只需要添加“ window_start_dt”和“ window_end_dt”列。 當前代碼有效,如果可能的話,只需對其向量化即可。

編輯2pandas.cut是內置的:

    tt=[[1,'2004-01-02',0.1,25,47],
[1,'2004-01-17',0.2,150,8],
[2,'2004-01-29',0.2,150,25],
[3,'2017-07-15',0.3,55,17],
[3,'2016-05-12',0.3,55,47],
[4,'2012-02-23',0.2,150,22],
[4,'2009-10-10',0.1,25,12],
[4,'2014-04-04',0.2,150,2],
[5,'2008-07-09',0.2,150,43]]



start_date_period = pd.date_range('2004-01-01', '12-01-2017', freq='MS')
end_date_period = pd.date_range('2004-01-30', '12-31-2017', freq='M')

df = pd.DataFrame(tt,columns=['customer_id','transaction_dt','product','price','units'])
df['transaction_dt'] = pd.Series([pd.to_datetime(sub_t[1],format='%Y-%m-%d') for sub_t in tt])

the_cut = pd.cut(df['transaction_dt'],bins=start_date_period,right=True,labels=False,include_lowest=True)

df['win_start_test'] = pd.Series([start_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])
df['win_end_test'] = pd.Series([end_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])

print(df.head())

win_start_testwin_end_test應該等於使用您的函數計算出的對應值。

ValueError是由於未在相關行中將x強制轉換為int而引起的。 我還添加了一個NaN支票,盡管此玩具示例不需要此支票。

注意對pd.date_range的更改以及月開始和月結束標志MMS ,以及將日期字符串轉換為datetime

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM