简体   繁体   English

使用Numba处理pandas DataFrame时间序列的有效方法

[英]Efficient way to process pandas DataFrame timeseries with Numba

I have a DataFrame with 1,500,000 rows. 我有一个包含1,500,000行的DataFrame。 It's one-minute level stock market data that I bought from QuantQuote.com. 这是我从QuantQuote.com购买的一分钟级别的股票市场数据。 (Open, High, Low, Close, Volume). (开盘价,最高价,最低价,收盘价,成交量) I'm trying to run some home-made backtests of stockmarket trading strategies. 我正在尝试进行一些自制的股市交易策略回溯测试。 Straight python code to process the transactions is too slow and I wanted to try to use numba to speed things up. 用于处理事务的直接python代码太慢了,我想尝试使用numba来加快速度。 The trouble is that numba doesn't seem to work with pandas functions . 问题是numba似乎不适用于pandas功能

Google searches uncover a surprising lack of information about using numba with pandas. 谷歌搜索发现了一个令人惊讶的缺乏关于使用熊猫与大熊猫的信息。 Which makes me wonder if I'm making a mistake by considering it. 这让我想知道我是否因为考虑它而犯了错误。

My setup is Numba 0.13.0-1, Pandas 0.13.1-1. 我的设置是Numba 0.13.0-1,Pandas 0.13.1-1。 Windows 7, MS VS2013 with PTVS, Python 2.7, Enthought Canopy Windows 7,MS VS2013与PTVS,Python 2.7,Enthought Canopy

My existing Python+Pandas innerloop has the following general structure 我现有的Python + Pandas内环具有以下一般结构

  • Compute "indicator" columns, (with pd.ewma, pd.rolling_max, pd.rolling_min etc.) 计算“指标”列,(使用pd.ewma,pd.rolling_max,pd.rolling_min等)
  • Compute "event" columns for predetermined events such as moving average crosses, new highs etc. 为预定事件计算“事件”列,例如移动平均线,新高等。

I then use DataFrame.iterrows to process the DataFrame. 然后我使用DataFrame.iterrows来处理DataFrame。

I've tried various optimizations but it's still not as fast as I would like. 我尝试了各种优化,但它仍然没有我想要的那么快。 And the optimizations are causing bugs. 并且优化会导致错误。

I want to use numba to process the rows. 我想用numba来处理行。 Are there preferred methods of approaching this? 是否有接近这个的首选方法?

Because my DataFrame is really just a rectangle of floats, I was considering using something like DataFrame.values to get access to the data and then write a series of functions that use numba to access the rows. 因为我的DataFrame实际上只是一个浮点矩形,所以我正在考虑使用像DataFrame.values这样的东西来访问数据,然后编写一系列使用numba来访问行的函数。 But that removes all the timestamps and I don't think it is a reversible operation. 但是这会删除所有时间戳,我不认为这是一个可逆的操作。 I'm not sure if the values matrix that I get from DataFrame.values is guaranteed to not be a copy of the data. 我不确定从DataFrame.values获得的值矩阵是否保证不是数据的副本。

Any help is greatly appreciated. 任何帮助是极大的赞赏。

Numba is a NumPy-aware just-in-time compiler. Numba是一个NumPy感知的即时编译器。 You can pass NumPy arrays as parameters to your Numba-compiled functions, but not Pandas series. 您可以将NumPy数组作为参数传递给Numba编译的函数,但不能传递给Pandas系列。

Your only option, still as of 2017-06-27, is to use the Pandas series values, which are actually NumPy arrays. 你唯一的选择,仍然是2017-06-27,是使用Pandas系列值,实际上是NumPy数组。

Also, you ask if the values are " guaranteed to not be a copy of the data ". 此外,您询问值是否“ 保证不是数据的副本 ”。 They are not a copy, you can verify that: 它们不是副本,您可以验证:

import pandas


df = pandas.DataFrame([0, 1, 2, 3])
df.values[2] = 8
print(df)  # Should show you the value `8`

In my opinion, Numba is a great (if not the best) approach to processing market data and you want to stick to Python only. 在我看来,Numba是处理市场数据的一种很好的(如果不是最好的)方法,你只想坚持使用Python。 If you want to see great performance gains, make sure to use @numba.jit(nopython=True) (note that this will not allow you to use dictionaries and other Python types inside the JIT-compiled functions, but will make the code run much faster). 如果你想看到很好的性能提升,请确保使用@numba.jit(nopython=True) (请注意,这将不允许您在JIT编译的函数中使用字典和其他Python类型,但会使代码运行快多了)。

Note that some of those indicators you are working with may already have an efficient implementation in Pandas, so consider pre-computing them with Pandas and then pass the values (the NumPy array) to your Numba backtesting function. 请注意,您正在使用的某些指标可能已经在Pandas中实现了高效实现,因此请考虑使用Pandas预先计算它们,然后将值(NumPy数组)传递给Numba回测功能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM