简体   繁体   English

python statsmodels:帮助将ARIMA模型用于时间序列

[英]python statsmodels: Help using ARIMA model for time series

ARIMA from statsmodels is giving me inaccurate answers for my output. statsmodels的ARIMA给我输出的答案不正确。 I was wondering whether someone could help me understand what's wrong with my code. 我想知道是否有人可以帮助我理解我的代码有什么问题。

This is a sample: 这是一个示例:

import pandas as pd
import numpy as np
import datetime as dt
from statsmodels.tsa.arima_model import ARIMA

# Setting up a data frame that looks twenty days into the past,
# and has linear data, from approximately 1 through 20
counts = np.arange(1, 21) + 0.2 * (np.random.random(size=(20,)) - 0.5)
start = dt.datetime.strptime("1 Nov 01", "%d %b %y")
daterange = pd.date_range(start, periods=20)
table = {"count": counts, "date": daterange}
data = pd.DataFrame(table)
data.set_index("date", inplace=True)

print data

               count
date
2001-11-01   0.998543
2001-11-02   1.914526
2001-11-03   3.057407
2001-11-04   4.044301
2001-11-05   4.952441
2001-11-06   6.002932
2001-11-07   6.930134
2001-11-08   8.011137
2001-11-09   9.040393
2001-11-10  10.097007
2001-11-11  11.063742
2001-11-12  12.051951
2001-11-13  13.062637
2001-11-14  14.086016
2001-11-15  15.096826
2001-11-16  15.944886
2001-11-17  17.027107
2001-11-18  17.930240
2001-11-19  18.984202
2001-11-20  19.971603

The rest of the code sets up the ARIMA model. 其余代码将建立ARIMA模型。

# Setting up ARIMA model
order = (2, 1, 2)
model = ARIMA(data, order, freq='D')
model = model.fit()
print model.predict(1, 20)

2001-11-02    1.006694
2001-11-03    1.056678
2001-11-04    1.116292
2001-11-05    1.049992
2001-11-06    0.869610
2001-11-07    1.016006
2001-11-08    1.110689
2001-11-09    0.945190
2001-11-10    0.882679
2001-11-11    1.139272
2001-11-12    1.094019
2001-11-13    0.918182
2001-11-14    1.027932
2001-11-15    1.041074
2001-11-16    0.898727
2001-11-17    1.078199
2001-11-18    1.027331
2001-11-19    0.978840
2001-11-20    0.943520
2001-11-21    1.040227
Freq: D, dtype: float64

As you could see, the data is just constant around 1 instead of increasing. 如您所见,数据只是在1左右恒定而不是增加。 What am I doing wrong here? 我在这里做错了什么?

(On a side note, I can't pass in string dates like "2001-11-21" into the predict function for some reason. It would be helpful to know why.) (附带说明,由于某种原因,我无法将字符串日期(如"2001-11-21"传递给预测函数。了解原因将很有帮助。)

TL;DR TL; DR

The way you use predict returns a linear prediction in terms of the differenced endogenous variable not a prediction of the levels of the original endogenous variable . 使用predict的方式将根据差异内生变量返回线性预测,而不是原始内生变量水平的预测。

You must feed predict method with typ='levels' to change this behavior: 您必须输入带有typ='levels' predict方法才能更改此行为:

preds = fit.predict(1, 30, typ='levels')

See documentation of ARIMAResults.predict for details. 有关详细信息,请参见ARIMAResults.predict文档。

Step by step 一步步

Dataset 数据集

We load data you provided in your MCVE: 我们加载您在MCVE中提供的数据:

import io
import pandas as pd

raw = io.StringIO("""date        count
2001-11-01   0.998543
2001-11-02   1.914526
2001-11-03   3.057407
2001-11-04   4.044301
2001-11-05   4.952441
2001-11-06   6.002932
2001-11-07   6.930134
2001-11-08   8.011137
2001-11-09   9.040393
2001-11-10  10.097007
2001-11-11  11.063742
2001-11-12  12.051951
2001-11-13  13.062637
2001-11-14  14.086016
2001-11-15  15.096826
2001-11-16  15.944886
2001-11-17  17.027107
2001-11-18  17.930240
2001-11-19  18.984202
2001-11-20  19.971603""")

data = pd.read_fwf(raw, parse_dates=['date'], index_col='date')

As we may expect data are auto-correlated: 如我们所料,数据是自动相关的:

from pandas.plotting import autocorrelation_plot
autocorrelation_plot(data)

在此处输入图片说明

Model & Training 模型与训练

We create an ARIMA Model object for a given setup (P,D,Q) and we train it on our data using the fit method: 我们为给定的设置(P,D,Q)创建ARIMA 模型对象,并使用fit方法在数据上训练它:

from statsmodels.tsa.arima_model import ARIMA

order = (2, 1, 2)
model = ARIMA(data, order, freq='D')
fit = model.fit()

It returns an ARIMAResults object which is matter of interest. 它返回一个ARIMAResults对象。 We can check out the quality of our model: 我们可以检查模型的质量:

fit.summary()

                            ARIMA Model Results                              
==============================================================================
Dep. Variable:                D.count   No. Observations:                   19
Model:                 ARIMA(2, 1, 2)   Log Likelihood                  25.395
Method:                       css-mle   S.D. of innovations              0.059
Date:                Fri, 18 Jan 2019   AIC                            -38.790
Time:                        07:54:36   BIC                            -33.123
Sample:                    11-02-2001   HQIC                           -37.831
                         - 11-20-2001                                         
==============================================================================
                  coef    std err          z      P>|z|      [0.025     0.975]
------------------------------------------------------------------------------
const           1.0001      0.014     73.731      0.000       0.973      1.027
ar.L1.D.count  -0.3971      0.295     -1.346      0.200      -0.975      0.181
ar.L2.D.count  -0.6571      0.230     -2.851      0.013      -1.109     -0.205
ma.L1.D.count   0.0892      0.208      0.429      0.674      -0.318      0.496
ma.L2.D.count   1.0000      0.640      1.563      0.140      -0.254      2.254
                                    Roots                                    
==============================================================================
                   Real          Imaginary           Modulus         Frequency
------------------------------------------------------------------------------
AR.1            -0.3022           -1.1961j            1.2336           -0.2894
AR.2            -0.3022           +1.1961j            1.2336            0.2894
MA.1            -0.0446           -0.9990j            1.0000           -0.2571
MA.2            -0.0446           +0.9990j            1.0000            0.2571
------------------------------------------------------------------------------

And we can roughly estimate how residuals are distributed: 我们可以粗略估计残差的分布方式:

residuals = pd.DataFrame(fit.resid, columns=['residuals'])
residuals.plot(kind='kde')

在此处输入图片说明

Prediction 预测

If we are satisfied with our model, then we can predict some in-sample or out-sample data. 如果我们对模型感到满意,那么我们可以预测一些样本内或样本外数据。

This can be done with the predict method which by default returns the differenced endogenous variable not the endogenous variable itself . 这可以通过predict方法来完成,该方法默认情况下返回差异的内生变量而不是内生变量本身 To change this behavior, we must specify typ='levels' : 要更改此行为,我们必须指定typ='levels'

preds = fit.predict(1, 30, typ='levels')

Then our predictions do have the same levels of our training data: 然后,我们的预测确实具有相同级别的训练数据:

在此处输入图片说明

Additionally, if we are interested to also have confidence intervals, then we can use the forecast method. 此外,如果我们也希望有置信区间,则可以使用forecast方法。

String Argument 字符串参数

It is also possible to feed predict with strings (always use the ISO-8601 format if you want to avoid troubles) or datetime objects: 还可以使用字符串(如果要避免麻烦,请始终使用ISO-8601格式)或datetime对象来提供predict

preds = fit.predict("2001-11-02", "2001-12-15", typ='levels')

Works as expected on StatsModels 0.9.0: 在StatsModels 0.9.0上按预期工作:

import statsmodels as sm
sm.__version__ # '0.9.0'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM