[英]Querying Python Pandas DataFrame with a Datetime index or column
So, I'm new about the package Pandas. 因此,我对Pandas软件包不熟悉。 I was doing some back test on a strategy on ETFs, that I need to do a lot of queries on Pandas Dataframe.
我当时正在对ETF策略进行回溯测试,我需要对Pandas Dataframe进行很多查询。
So let's say I'm these two DataFrames, df and df1, the only difference is that: df has datetime Index, while df1 has the timestamp as a column and an integer Index 假设我是df和df1这两个DataFrame,唯一的区别是:df具有datetime索引,而df1具有timestamp作为列和整数Index
In[104]: df.head()
Out[104]:
high low open close volume openInterest
2007-04-24 09:31:00 148.28 148.12 148.23 148.15 2304400 341400
2007-04-24 09:32:00 148.21 148.14 148.14 148.19 2753500 449100
2007-04-24 09:33:00 148.24 148.13 148.18 148.14 2863400 109900
2007-04-24 09:34:00 148.18 148.12 148.13 148.16 3118287 254887
2007-04-24 09:35:00 148.17 148.14 148.16 148.16 3202112 83825
In[105]: df1.head()
Out[105]:
dates high low open close volume openInterest
0 2007-04-24 09:31:00 148.28 148.12 148.23 148.15 2304400 341400
1 2007-04-24 09:32:00 148.21 148.14 148.14 148.19 2753500 449100
2 2007-04-24 09:33:00 148.24 148.13 148.18 148.14 2863400 109900
3 2007-04-24 09:34:00 148.18 148.12 148.13 148.16 3118287 254887
4 2007-04-24 09:35:00 148.17 148.14 148.16 148.16 3202112 83825
so I test the query speed a little bit: 所以我测试一下查询速度:
In[100]: %timeit df1[(df1['dates'] >= '2015-11-17') & (df1['dates'] < '2015-11-18')]
%timeit df.loc[(df.index >= '2015-11-17') & (df.index < '2015-11-18')]
%timeit df.loc['2015-11-17']
100 loops, best of 3: 4.67 ms per loop
100 loops, best of 3: 3.14 ms per loop
1 loop, best of 3: 259 ms per loop
To my surprise is that using the logic built in with Pandas is actually the slowest: 令我惊讶的是,使用熊猫内置的逻辑实际上是最慢的:
df.loc['2015-11-17']
Does anyone know why is that? 有谁知道那是为什么? And are there any documents or blogs about the most efficient ways to query a Pandas DataFrame?
是否有任何文档或博客有关查询Pandas DataFrame的最有效方法?
If I were you I would use the simpler method: 如果我是你,我将使用更简单的方法:
df['2015-11-17']
in my opinion this would be more 'pandas logic' than using .loc[]
for a single date. 在我看来,这比在单个日期中使用
.loc[]
更像是“熊猫逻辑”。 I am guessing it is also faster. 我猜它也更快。
testing on a minute OHLC dataframe: 在分钟的OHLC数据帧上进行测试:
%timeit df.loc[(df.index >= '2015-11-17') & (df.index < '2015-11-18')]
%timeit df.loc['2015-11-17']
%timeit df['2015-11-17']
100 loops, best of 3: 13.8 ms per loop
1 loop, best of 3: 1.39 s per loop
1000 loops, best of 3: 486 us per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.