[英]How to properly set start/end params of statsmodels.tsa.ar_model.AR.predict function
I have a dataframe of project costs from an irregularly spaced time series that I would like to try to apply the statsmodel
AR model against . 我有一个项目成本的数据框,来自不规则间隔的时间序列,我想尝试应用
statsmodel
AR模型 。
This is a sample of the data in it's dataframe: 这是数据框中数据的示例:
cost
date
2015-07-16 35.98
2015-08-11 25.00
2015-08-11 43.94
2015-08-13 26.25
2015-08-18 15.38
2015-08-24 77.72
2015-09-09 40.00
2015-09-09 20.00
2015-09-09 65.00
2015-09-23 70.50
2015-09-29 59.00
2015-11-03 19.25
2015-11-04 19.97
2015-11-10 26.25
2015-11-12 19.97
2015-11-12 23.97
2015-11-12 21.88
2015-11-23 23.50
2015-11-23 33.75
2015-11-23 22.70
2015-11-23 33.75
2015-11-24 27.95
2015-11-24 27.95
2015-11-24 27.95
...
2017-03-31 21.93
2017-04-06 22.45
2017-04-06 26.85
2017-04-12 60.40
2017-04-12 37.00
2017-04-12 20.00
2017-04-12 66.00
2017-04-12 60.00
2017-04-13 41.95
2017-04-13 25.97
2017-04-13 29.48
2017-04-19 41.00
2017-04-19 58.00
2017-04-19 78.00
2017-04-19 12.00
2017-04-24 51.05
2017-04-26 21.88
2017-04-26 50.05
2017-04-28 21.00
2017-04-28 30.00
I am having a hard time understanding how to use start
and end
in the predict
function. 我很难理解如何在
predict
函数中使用start
和end
。
According to the docs : 根据文件 :
start : int, str, or datetime Zero-indexed observation number at which to start forecasting, ie., the first > forecast is start.
start:int,str或datetime开始预测的零索引观察数,即第一个>预测开始。 Can also be a date string to parse or a datetime type.
也可以是要解析的日期字符串或日期时间类型。
end : int, str, or datetime Zero-indexed observation number at which to end forecasting, ie., the first forecast is start.
end:int,str或datetime用于结束预测的零索引观察数,即第一个预测开始。 Can also be a date string to parse or a datetime type.
也可以是要解析的日期字符串或日期时间类型。
I create a dataframe that has an empty daily time series, add my irregularly spaced time series data to it, and then try to apply the model. 我创建了一个每日空时间为空的数据框,将不规则间隔的时间序列数据添加到其中,然后尝试应用模型。
data = pd.read_csv('data.csv', index_col=1, parse_dates=True)
df = pd.DataFrame(index=pd.date_range(start=datetime(2015, 1, 1), end=datetime(2017, 12, 31), freq='d'))
df = df.join(data)
df.cost.interpolate(inplace=True)
ar_model = sm.tsa.AR(df, missing='drop', freq='D')
ar_res = ar_model.fit(maxlag=9, method='mle', disp=-1)
pred = ar_res.predict(start='2016', end='2016')
The predict
function results in an error of pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 605-12-31 00:00:00
predict
函数导致pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 605-12-31 00:00:00
错误pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 605-12-31 00:00:00
If I try to use a more specific date, I get the same type of error: 如果我尝试使用更具体的日期,我会得到相同类型的错误:
pred = ar_res.predict(start='2016-01-01', end='2016-06-01')
If I try to use integers, I get a different error: 如果我尝试使用整数,我会得到一个不同的错误:
pred = ar_res.predict(start=0, end=len(data))
Wrong number of items passed 202, placement implies 197
If I actually use a datetime
, I get an error that reads no rule for interpreting end
. 如果我实际上使用了一个
datetime
,我会收到一条错误,它no rule for interpreting end
读取no rule for interpreting end
。
I am hitting a wall so hard here I am thinking there must be something I am missing. 我在这里如此努力地撞墙,我想我必须有一些我想念的东西。
Ultimately, I would like to use the model to get out-of-sample predictions (such as a prediction for next quarter). 最后,我想使用该模型进行样本外预测(例如下一季度的预测)。
This works if you pass a datetime
(rather than a date
): 如果您传递
datetime
(而不是date
),则此方法有效:
from datetime import datetime
...
pred = ar_res.predict(start=datetime(2015, 1, 1), end=datetime(2017,12,31))
In [21]: pred.head(2) # my dummy numbers from data
Out[21]:
2015-01-01 35
2015-01-02 23
Freq: D, dtype: float64
In [22]: pred.tail(2)
Out[22]:
2017-12-30 44
2017-12-31 44
Freq: D, dtype: float64
So I was creating a daily index to account for the equally spaced time series requirement, but it still remained non-unique (comment by @user333700). 所以我创建了一个每日索引来考虑等间隔时间序列要求,但它仍然是非唯一的(由@ user333700评论)。
I added a groupby
function to sum duplicate dates together, and could then run the predict
function using datetime
objects (answer by @andy-hayden). 我添加了一个
groupby
函数来将重复日期加在一起,然后可以使用datetime
对象运行predict
函数(由@andy-hayden回答)。
df = df.groupby(pd.TimeGrouper(freq='D')).sum()
...
ar_res.predict(start=min(df.index), end=datetime(2018,12,31))
With the predict
function providing a result, I am now able to analyze the results and tweak the params to get something useful. 通过
predict
函数提供结果,我现在能够分析结果并调整参数以获得有用的东西。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.