[英]Time-series - data splitting and model evaluation
I've tried to use machine learning to make prediction based on time-series data.我尝试使用机器学习根据时间序列数据进行预测。 In one of the stackoverflow question ( createTimeSlices function in CARET package in R ) is an example of using createTimeSlices to cross-validation for model training and parameter tuning:
在 stackoverflow 问题之一中( R 中 CARET 包中的 createTimeSlices 函数)是使用 createTimeSlices 进行模型训练和参数调整的交叉验证的示例:
library(caret)
library(ggplot2)
library(pls)
data(economics)
myTimeControl <- trainControl(method = "timeslice",
initialWindow = 36,
horizon = 12,
fixedWindow = TRUE)
plsFitTime <- train(unemploy ~ pce + pop + psavert,
data = economics,
method = "pls",
preProc = c("center", "scale"),
trControl = myTimeControl)
My understanding is:我的理解是:
Because my data is time-series, I suppose that I cannot use bootstraping for spliting data into training and test set.因为我的数据是时间序列的,所以我想我不能使用引导将数据拆分为训练集和测试集。 So, my questions are: Am I right?
所以,我的问题是:我是对的吗? And If so - How to use createTimeSlices for model evaluation?
如果是这样 - 如何使用 createTimeSlices 进行模型评估?
Note that the original question that you have posted, takes care of the timeSlicing, and you don't have to create timeSlices by hand.请注意,您发布的原始问题会处理时间切片,您不必手动创建时间切片。
However, here is how to use createTimeSlices
for splitting the data and then using it for training and testing a model.但是,这里是如何使用
createTimeSlices
来拆分数据,然后使用它来训练和测试模型。
Step 0: Setting up the data and trainControl
:(from your question)第 0 步:设置数据和
trainControl
:(来自您的问题)
library(caret)
library(ggplot2)
library(pls)
data(economics)
Step 1: Creating the timeSlices for the index of the data:第 1 步:为数据索引创建时间片:
timeSlices <- createTimeSlices(1:nrow(economics),
initialWindow = 36, horizon = 12, fixedWindow = TRUE)
This creates a list of training and testing timeSlices.这将创建一个训练和测试时间片列表。
> str(timeSlices,max.level = 1)
## List of 2
## $ train:List of 431
## .. [list output truncated]
## $ test :List of 431
## .. [list output truncated]
For ease of understanding, I am saving them in separate variable:为了便于理解,我将它们保存在单独的变量中:
trainSlices <- timeSlices[[1]]
testSlices <- timeSlices[[2]]
Step 2: Training on the first of the trainSlices
:第 2 步:在第一个
trainSlices
上进行训练:
plsFitTime <- train(unemploy ~ pce + pop + psavert,
data = economics[trainSlices[[1]],],
method = "pls",
preProc = c("center", "scale"))
Step 3: Testing on the first of the testSlices
:第 3 步:在第一个
testSlices
上进行测试:
pred <- predict(plsFitTime,economics[testSlices[[1]],])
Step 4: Plotting:第 4 步:绘图:
true <- economics$unemploy[testSlices[[1]]]
plot(true, col = "red", ylab = "true (red) , pred (blue)", ylim = range(c(pred,true)))
points(pred, col = "blue")
You can then do this for all the slices:然后,您可以对所有切片执行此操作:
for(i in 1:length(trainSlices)){
plsFitTime <- train(unemploy ~ pce + pop + psavert,
data = economics[trainSlices[[i]],],
method = "pls",
preProc = c("center", "scale"))
pred <- predict(plsFitTime,economics[testSlices[[i]],])
true <- economics$unemploy[testSlices[[i]]]
plot(true, col = "red", ylab = "true (red) , pred (blue)",
main = i, ylim = range(c(pred,true)))
points(pred, col = "blue")
}
As mentioned earlier, this sort of timeSlicing is done by your original function in one step:如前所述,这种时间切片是由您的原始函数一步完成的:
> myTimeControl <- trainControl(method = "timeslice",
+ initialWindow = 36,
+ horizon = 12,
+ fixedWindow = TRUE)
>
> plsFitTime <- train(unemploy ~ pce + pop + psavert,
+ data = economics,
+ method = "pls",
+ preProc = c("center", "scale"),
+ trControl = myTimeControl)
> plsFitTime
Partial Least Squares
478 samples
5 predictors
Pre-processing: centered, scaled
Resampling: Rolling Forecasting Origin Resampling (12 held-out with a fixed window)
Summary of sample sizes: 36, 36, 36, 36, 36, 36, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared RMSE SD Rsquared SD
1 1080 0.443 796 0.297
2 1090 0.43 845 0.295
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 1.
Hope this helps!!希望这可以帮助!!
Shambho's answer provides decent example of how to use the caret package with TimeSlices, however, it can be misleading in terms of modelling technique. Shambho 的回答提供了如何将插入符号包与 TimeSlices 一起使用的不错示例,但是,它在建模技术方面可能会产生误导。 So in order not to misguide future readers that want to use the caret package for predictive modelling on time-series (and here I do not mean autoregressive models), I want to highlight a few things.
因此,为了不误导希望使用 caret 包对时间序列进行预测建模的未来读者(这里我不是指自回归模型),我想强调一些事情。
The problem with time-series data is that look-ahead bias is easy if one is not careful.时间序列数据的问题在于,如果不小心,很容易出现前瞻偏差。 In this case, the economics data set has aligned data at their economic reporting dates and not their release date, which is never the case in real live applications (economic data points have different time stamps).
在这种情况下,经济数据集在其经济报告日期而不是其发布日期对齐数据,这在实际应用中从未出现过(经济数据点具有不同的时间戳)。 Unemployment data may be two months behind the other indicators in terms of release date, which would then introduce a model bias in Shambho's example.
就发布日期而言,失业数据可能比其他指标晚两个月,这将在 Shambho 的示例中引入模型偏差。
Next, this example is only descriptive statistics and not predictive (forecasting) because the data we want to forecast (unemploy) is not lagged correctly.接下来,这个例子只是描述性统计,而不是预测(预测),因为我们想要预测(失业)的数据没有正确滞后。 It merely trains a model to best explain the variation in unemployment (which also in this case is a stationary time-series creating all sorts of issues in modelling process) based on predictor variables at the same economic report dates.
它只是训练一个模型,以根据同一经济报告日期的预测变量最好地解释失业率的变化(在这种情况下也是一个固定的时间序列,在建模过程中产生各种问题)。
Lastly, the 12-month horizon in this example is not a true multi-period forecasting as Hyndman does it in his examples.最后,本示例中的 12 个月范围并不是真正的多期预测,正如 Hyndman 在他的示例中所做的那样。
Hyndman on cross-validation for time-series Hyndman 关于时间序列的交叉验证
Actually, you can!其实,你可以!
First, let me give you a scholarly article on the topic .首先,让我给你一篇关于这个主题的学术文章。
In R:在 R 中:
Using the package caret
, createResample
can be used to make simple bootstrap samples and createFolds
can be used to generate balanced cross–validation groupings from a set of data.使用包
caret
, createResample
可用于制作简单的引导样本, createFolds
可用于从一组数据生成平衡的交叉验证分组。 So you'll probably want to use createResample
.所以你可能想要使用
createResample
。 Here's an example of its usage:这是它的用法示例:
data(oil)
createDataPartition(oilType, 2)
x <- rgamma(50, 3, .5)
inA <- createDataPartition(x, list = FALSE)
plot(density(x[inA]))
rug(x[inA])
points(density(x[-inA]), type = "l", col = 4)
rug(x[-inA], col = 4)
createResample(oilType, 2)
createFolds(oilType, 10)
createFolds(oilType, 5, FALSE)
createFolds(rnorm(21))
createTimeSlices(1:9, 5, 1, fixedWindow = FALSE)
createTimeSlices(1:9, 5, 1, fixedWindow = TRUE)
createTimeSlices(1:9, 5, 3, fixedWindow = TRUE)
createTimeSlices(1:9, 5, 3, fixedWindow = FALSE)
The values you see in the createResample
function are the data and the number of partitions to create, in this case 2. You can additionally specify if the results should be stored as a list with list = TRUE
or list = FALSE
.您在
createResample
函数中看到的值是数据和要创建的分区数,在本例中为 2。您可以另外指定结果是否应存储为list = TRUE
或list = FALSE
的列表。
Additionally, caret
contains a function called createTimeSlices
that can create the indices for this type of splitting.此外,
caret
包含一个名为createTimeSlices
的函数,可以为这种类型的拆分创建索引。
The three parameters for this type of splitting are:这种拆分的三个参数是:
initialWindow
: the initial number of consecutive values in each training set sample initialWindow
:每个训练集样本中连续值的初始个数horizon
: The number of consecutive values in test set sample horizon
:测试集样本中连续值的数量fixedWindow
: A logical: if FALSE, the training set always start at the first sample and the training set size will vary over data splits. fixedWindow
:一个逻辑:如果为 FALSE,则训练集始终从第一个样本开始,并且训练集的大小将随着数据拆分而变化。 Usage:用法:
createDataPartition(y,
times = 1,
p = 0.5,
list = TRUE,
groups = min(5, length(y)))
createResample(y, times = 10, list = TRUE)
createFolds(y, k = 10, list = TRUE, returnTrain = FALSE)
createMultiFolds(y, k = 10, times = 5)
createTimeSlices(y, initialWindow, horizon = 1, fixedWindow = TRUE)
Sources:资料来源:
http://caret.r-forge.r-project.org/splitting.html http://caret.r-forge.r-project.org/splitting.html
http://eranraviv.com/blog/bootstrapping-time-series-r-code/ http://eranraviv.com/blog/bootstrapping-time-series-r-code/
http://rgm3.lab.nig.ac.jp/RGM/R_rdfile?f=caret/man/createDataPartition.Rd&d=R_CC http://rgm3.lab.nig.ac.jp/RGM/R_rdfile?f=caret/man/createDataPartition.Rd&d=R_CC
CARET. 插入符号。 Relationship between data splitting and trainControl
数据拆分与trainControl的关系
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.