[英]python pandas: vectorized time series window function
I have a pandas dataframe in the following format: 我有以下格式的熊猫数据框:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
3,2017-07-15,thing3,55,17
3,2016-05-12,thing3,55,47
4,2012-02-23,thing2,150,22
4,2009-10-10,thing1,25,12
4,2014-04-04,thing2,150,2
5,2008-07-09,thing2,150,43
I have written the following to create two new fields indicating 30 day windows: 我编写了以下内容以创建两个新字段,以指示30天的有效期:
import numpy as np
import pandas as pd
start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')
def find_window_start_date(x):
window_start_date_idx = np.argmax(x < start_date_period.end_time)
return start_date_period[window_start_date_idx]
df['window_start_dt'] = df['transaction_dt'].apply(find_window_start_date)
def find_window_end_date(x):
window_end_date_idx = np.argmin(x > end_date_period.start_time)
return end_date_period[window_end_date_idx]
df['window_end_dt'] = df['transaction_dt'].apply(find_window_end_date)
Unfortunately, this is far too slow doing the row-wise apply for my application. 不幸的是,这对我的应用程序逐行申请来说太慢了。 I would greatly appreciate any tips on vectorizing these functions if possible. 如果可能的话,我将不胜感激将这些功能向量化的任何技巧。
EDIT: 编辑:
The resultant dataframe should have this layout: 结果数据框应具有以下布局:
'customer_id','transaction_dt','product','price','units','window_start_dt','window_end_dt'
It does not need to be resampled or windowed in the formal sense. 从形式上讲,它不需要重新采样或加窗。 It just needs 'window_start_dt' and 'window_end_dt' columns to be added. 它只需要添加“ window_start_dt”和“ window_end_dt”列。 The current code works, it just need to be vectorized if possible. 当前代码有效,如果可能的话,只需对其向量化即可。
EDIT 2 : pandas.cut is built-in: 编辑2 : pandas.cut是内置的:
tt=[[1,'2004-01-02',0.1,25,47],
[1,'2004-01-17',0.2,150,8],
[2,'2004-01-29',0.2,150,25],
[3,'2017-07-15',0.3,55,17],
[3,'2016-05-12',0.3,55,47],
[4,'2012-02-23',0.2,150,22],
[4,'2009-10-10',0.1,25,12],
[4,'2014-04-04',0.2,150,2],
[5,'2008-07-09',0.2,150,43]]
start_date_period = pd.date_range('2004-01-01', '12-01-2017', freq='MS')
end_date_period = pd.date_range('2004-01-30', '12-31-2017', freq='M')
df = pd.DataFrame(tt,columns=['customer_id','transaction_dt','product','price','units'])
df['transaction_dt'] = pd.Series([pd.to_datetime(sub_t[1],format='%Y-%m-%d') for sub_t in tt])
the_cut = pd.cut(df['transaction_dt'],bins=start_date_period,right=True,labels=False,include_lowest=True)
df['win_start_test'] = pd.Series([start_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])
df['win_end_test'] = pd.Series([end_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])
print(df.head())
win_start_test
and win_end_test
should be equal to their counterparts computed using your function. win_start_test
和win_end_test
应该等于使用您的函数计算出的对应值。
The ValueError
was coming from not casting x
to int
in the relevant line. ValueError
是由于未在相关行中将x
强制转换为int
而引起的。 I also added a NaN
check, though it wasn't needed for this toy example. 我还添加了一个NaN
支票,尽管此玩具示例不需要此支票。
Note the change to pd.date_range
and the use of the start-of-month and end-of-month flags M
and MS
, as well as converting the date strings into datetime
. 注意对pd.date_range
的更改以及月开始和月结束标志M
和MS
,以及将日期字符串转换为datetime
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.