简体   繁体   English

使用 statsmodels 进行预测

[英]Forecasting with statsmodels

I have a .csv file containing a 5-year time series, with hourly resolution (commoditiy price).我有一个包含 5 年时间序列的 .csv 文件,每小时分辨率(商品价格)。 Based on the historical data, I want to create a forecast of the prices for the 6th year.根据历史数据,我想创建第 6 年的价格预测。

I have read a couple of articles on the www about these type of procedures, and I basically based my code on the code posted there, since my knowledge in both Python (especially statsmodels) and statistic is at most limited.我在 www 上阅读了几篇关于这些类型程序的文章,并且我的代码基本上基于发布在那里的代码,因为我在 Python(尤其是 statsmodels)和统计方面的知识至多是有限的。

Those are the links, for those who are interested:这些是链接,对于那些有兴趣的人:

http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-in-python/ http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-in-python/

http://www.johnwittenauer.net/a-simple-time-series-analysis-of-the-sp-500-index/ http://www.johnwittenauer.net/a-simple-time-series-analysis-of-the-sp-500-index/

First of all, here is a sample of the .csv file.首先,这里是 .csv 文件的示例。 Data is displayed with monthly resolution in this case, it is not real data, just randomly choosen numbers to give an example here (in which case I hope one year is enough to be able to develop a forecast for the 2nd year; if not, full csv file is available):这种情况下数据是按月分辨率显示的,不是真实数据,只是随机选择的数字在这里举个例子(这种情况我希望一年足以能够制定第二年的预测;如果不是,完整的 csv 文件可用):

              Price
2011-01-31    32.21
2011-02-28    28.32
2011-03-31    27.12
2011-04-30    29.56
2011-05-31    31.98
2011-06-30    26.25
2011-07-31    24.75
2011-08-31    25.56
2011-09-30    26.68
2011-10-31    29.12
2011-11-30    33.87
2011-12-31    35.45

My current progress is as follows:我目前的进展如下:

After reading the input file and setting the date column as datetime index, the follwing script was used to develop a forecast for the available data读取输入文件并将日期列设置为日期时间索引后,使用以下脚本对可用数据进行预测

model = sm.tsa.ARIMA(df['Price'].iloc[1:], order=(1, 0, 0))  
results = model.fit(disp=-1)  
df['Forecast'] = results.fittedvalues  
df[['Price', 'Forecast']].plot(figsize=(16, 12))  

,which gives the following output: ,这给出了以下输出:

5 年时间序列,每小时分辨率数据

Now, as I said, I ain't got no statistic skills and I have little to no idea how I got to this output (basically, changing the order attribute inside the first line changes the output), but the 'actual' forecast looks quite good and I would like to extend it for another year (2016).现在,正如我所说,我没有任何统计技能,我几乎不知道我是如何得到这个输出的(基本上,改变第一行内的订单属性会改变输出),但“实际”预测看起来非常好,我想再延长一年(2016 年)。

In order to do that, additional rows are created in the dataframe, as follows:为此,在数据框中创建了额外的行,如下所示:

start = datetime.datetime.strptime("2016-01-01", "%Y-%m-%d")
date_list = pd.date_range('2016-01-01', freq='1D', periods=366)
future = pd.DataFrame(index=date_list, columns= df.columns)
data = pd.concat([df, future])

Finally, when I use the .predict function of statsmodels:最后,当我使用 statsmodels 的 .predict 函数时:

data['Forecast'] = results.predict(start = 1825, end = 2192, dynamic= True)  
data[['Price', 'Forecast']].plot(figsize=(12, 8))

what I get as forecast is a straight line (see below), which doesn't seem at all like a forecast.我得到的预测是一条直线(见下文),这似乎根本不像预测。 Moreover, if I extend the range, which now is from the 1825th to 2192nd day (year of 2016), to the whole 6 year timespan, the forecast line is a straight line for the entire period (2011-2016).此外,如果我将范围从现在的第 1825 天到第 2192 天(2016 年)扩展到整个 6 年时间跨度,则预测线是整个时期(2011-2016 年)的直线。

I have also tried to use the 'statsmodels.tsa.statespace.sarimax.SARIMAX.predict' method, which accounts for a seasonal variation (which makes sense in this case), but I get some error about 'module' has no attribute 'SARIMAX'.我还尝试使用“statsmodels.tsa.statespace.sarimax.SARIMAX.predict”方法,该方法解释了季节性变化(在这种情况下是有意义的),但我收到一些关于“模块”没有属性的错误纱丽马克斯'。 But this is secondary problem, will get into more detail if needed.但这是次要问题,如果需要,将更详细地介绍。

预测输出

Somewhere I am losing grip and I have no idea where.我在某个地方失去了抓地力,我不知道在哪里。 Thanks for reading.感谢阅读。 Cheers!干杯!

It sounds like you are using an older version of statsmodels that does not support SARIMAX.听起来您正在使用不支持 SARIMAX 的旧版本 statsmodels。 You'll want to install the latest released version 0.8.0 see http://statsmodels.sourceforge.net/devel/install.html .您需要安装最新发布的 0.8.0 版,请参阅http://statsmodels.sourceforge.net/devel/install.html

I'm using Anaconda and installed via pip.我正在使用 Anaconda 并通过 pip 安装。

pip install -U statsmodels

The results class from the SARIMAX model have a number of useful methods including forecast. SARIMAX 模型的结果类有许多有用的方法,包括预测。

data['Forecast'] = results.forecast(100)

Will use your model to forecast 100 steps into the future.将使用您的模型预测未来的 100 个步骤。

ARIMA(1,0,0) is a one period autoregressive model. ARIMA(1,0,0) 是一个单周期自回归模型。 So it's a model that follows this formula:所以这是一个遵循这个公式的模型:

在此处输入图片说明

What that means is that the value in time period t is equal to some constant (phi_0) plus a value determined by fitting the ARMA model (phi_1) multiplied by the value in the prior period r_(t-1), plus a white noise error term (a_t).这意味着时间段 t 中的值等于某个常数 (phi_0) 加上通过拟合 ARMA 模型确定的值 (phi_1) 乘以前一时段 r_(t-1) 中的值,再加上白噪声误差项 (a_t)。

Your model only has a memory of 1 period, so the current prediction is entirely determined by the 1 value of the prior period.您的模型只有 1 个时期的记忆,因此当前预测完全由前一时期的 1 值决定。 It's not a very complex model;这不是一个非常复杂的模型; it's not doing anything fancy with all the prior values.它没有对所有先前的值做任何花哨的事情。 It's just taking yesterday's price, multiplying it by some value and adding a constant.它只是取昨天的价格,乘以某个值并添加一个常数。 You should expect it to quickly go to equilibrium and then stay there forever.你应该期望它很快达到平衡,然后永远保持在那里。

The reason why the forecast in the top picture looks so good is that it is just showing you hundreds of 1 period forecasts that are starting fresh with each new period.上图中的预测看起来如此出色的原因是它只是向您展示了数百个 1 期预测,这些预测在每个新时期都重新开始。 It's not showing a long period prediction like you probably think it is.它并没有像您认为的那样显示长期预测。

Looking at the link you sent:查看您发送的链接:

http://www.johnwittenauer.net/a-simple-time-series-analysis-of-the-sp-500-index/ http://www.johnwittenauer.net/a-simple-time-series-analysis-of-the-sp-500-index/

read the section where he discusses why this model doesn't give you what you want.阅读他讨论为什么这个模型没有给你你想要的部分。

"So at first glance it seems like this model is doing pretty well. But although it appears like the forecasts are really close (the lines are almost indistinguishable after all), remember that we used the un-differenced series! The index only fluctuates a small percentage day-to-day relative to the total absolute value. What we really want is to predict the first difference, or the day-to-day moves. We can either re-run the model using the differenced series, or add an "I" term to the ARIMA model (resulting in a (1, 1, 0) model) which should accomplish the same thing. Let's try using the differenced series." “所以乍一看,这个模型似乎做得很好。但尽管看起来预测非常接近(毕竟线条几乎无法区分),但请记住,我们使用了无差异系列!该指数仅波动相对于总绝对值的每日小百分比。我们真正想要的是预测第一个差异,或每日移动。我们可以使用差异系列重新运行模型,或者添加一个"I" 术语到 ARIMA 模型(导致 (1, 1, 0) 模型)应该完成相同的事情。让我们尝试使用差分系列。”

To do what you're trying to do, you'll need to do more research into these models and figure out how to format your data, and what model will be appropriate.要完成您想要做的事情,您需要对这些模型进行更多研究,并弄清楚如何设置数据格式以及哪种模型是合适的。 The most important thing is knowing what information you believe is contained in the data you're feeding into the model.最重要的是了解您认为提供给模型的数据中包含哪些信息。 What your model currently is trying to do is say, "Today the price is $45. What will the price be tomorrow?"你的模型目前试图做的是说,“今天的价格是 45 美元。明天的价格是多少?” That's it.就是这样。 It doesn't have any information about momentum, volatility, etc. That's not much to go off.它没有任何关于动量、波动性等的信息。这没什么好说的。

预测时尝试设置 dynamic = False

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM