[英]Improving performance of datetime comparison in pandas
I have a pandas dataframe that has values like: 我有一个熊猫数据框,其值如下:
df['ORDER_RECEIVED_DATE'].head()
Out[91]:
0 2018-01-01
1 2018-01-01
2 2018-01-01
3 2018-01-01
4 2018-01-01
I am defining a customized function to create another column "Period", depending on comparison with date values in "ORDER_RECEIVED_DATE" 我正在定义一个自定义函数来创建另一个列“ Period”,具体取决于与“ ORDER_RECEIVED_DATE”中的日期值进行比较
def period(x):
if x<pd.to_datetime('01-04-2018'):
y=0
elif x<pd.to_datetime('04-05-2018'):
y=1
elif x<pd.to_datetime('05-31-2018'):
y=2
elif x<pd.to_datetime('08-02-2018'):
y=3
elif x<pd.to_datetime('09-27-2018'):
y=4
elif x<pd.to_datetime('01-03-2019'):
y=5
elif x<pd.to_datetime('02-14-2019'):
y=6
elif x<pd.to_datetime('03-28-2019'):
y=7
else:
y=8
return y
Finally applying it to Pandas Column: 最后将其应用于“熊猫专栏”:
df['Period'] = df['ORDER_RECEIVED_DATE'].apply(lambda x:period(x))
But it's incredibly slow on ~1m records. 但是,在记录约100万条记录时,速度非常慢。 How can I speed it up?
我如何加快速度?
Try: 尝试:
old_date = '01-01-1970'
future_date = '01-01-2050'
cuts = pd.to_datetime([old_date, '01-04-2018', '04-05-2018', '05-31-2018',
'08-02-2018', '09-27-2018', '01-03-2019',
'02-14-2019', '03-28-2019', future_date])
df = pd.DataFrame({'date': pd.date_range('01-01-2018', '04-05-2019', freq='MS')})
df['ped'] = pd.cut(df['date'], bins=cuts).cat.codes
output: 输出:
+----+---------------------+-------+
| | date | ped |
|----+---------------------+-------|
| 0 | 2018-01-01 00:00:00 | 0 |
| 1 | 2018-02-01 00:00:00 | 1 |
| 2 | 2018-03-01 00:00:00 | 1 |
| 3 | 2018-04-01 00:00:00 | 1 |
| 4 | 2018-05-01 00:00:00 | 2 |
| 5 | 2018-06-01 00:00:00 | 3 |
| 6 | 2018-07-01 00:00:00 | 3 |
| 7 | 2018-08-01 00:00:00 | 3 |
| 8 | 2018-09-01 00:00:00 | 4 |
| 9 | 2018-10-01 00:00:00 | 5 |
| 10 | 2018-11-01 00:00:00 | 5 |
| 11 | 2018-12-01 00:00:00 | 5 |
| 12 | 2019-01-01 00:00:00 | 5 |
| 13 | 2019-02-01 00:00:00 | 6 |
| 14 | 2019-03-01 00:00:00 | 7 |
| 15 | 2019-04-01 00:00:00 | 8 |
+----+---------------------+-------+
Edit: There is an issue with end date, ie 2019-03-28
gives 7 in this code instead of 8
in your code. 编辑:结束日期存在问题,即
2019-03-28
在此代码中给出7而不是您的代码中的8
。 This can be solved by decreasing the thresholds by 1 day. 这可以通过将阈值降低1天来解决。
Assuming you make an array of dates 假设您要排列日期
dates = pd.to_datetime([
'01-04-2018', '04-05-2018', '05-31-2018',
'08-02-2018', '09-27-2018', '01-03-2019',
'02-14-2019', '03-28-2019'
]).values
You can use searchsorted
which will tell you the position each compared date belongs in the order of dates
您可以使用
searchsorted
,它会告诉您每个比较日期按dates
顺序排列的位置
df.assign(Period=dates.searchsorted(df.ORDER_RECEIVED_DATE))
ORDER_RECEIVED_DATE Period
0 2018-01-01 0
1 2018-02-01 1
2 2018-03-01 1
3 2018-04-01 1
4 2018-05-01 2
5 2018-06-01 3
6 2018-07-01 3
7 2018-08-01 3
8 2018-09-01 4
9 2018-10-01 5
10 2018-11-01 5
11 2018-12-01 5
12 2019-01-01 5
13 2019-02-01 6
14 2019-03-01 7
15 2019-04-01 8
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.