时间序列 - 数据拆分和模型评估

Question

I've tried to use machine learning to make prediction based on time-series data.我尝试使用机器学习根据时间序列数据进行预测。 In one of the stackoverflow question ( createTimeSlices function in CARET package in R ) is an example of using createTimeSlices to cross-validation for model training and parameter tuning:在 stackoverflow 问题之一中（ R 中 CARET 包中的 createTimeSlices 函数）是使用 createTimeSlices 进行模型训练和参数调整的交叉验证的示例：

    library(caret)
    library(ggplot2)
    library(pls)
    data(economics)
    myTimeControl <- trainControl(method = "timeslice",
                                  initialWindow = 36,
                                  horizon = 12,
                                  fixedWindow = TRUE)

    plsFitTime <- train(unemploy ~ pce + pop + psavert,
                        data = economics,
                        method = "pls",
                        preProc = c("center", "scale"),
                        trControl = myTimeControl)

My understanding is:我的理解是：

I need to split may data to training and test set.我需要将可能的数据拆分为训练和测试集。
Use training set for parameters tuning.使用训练集进行参数调整。
Evaluate obtained model on the test set (using R2, RMSE, etc.)在测试集上评估获得的模型（使用 R2、RMSE 等）

Because my data is time-series, I suppose that I cannot use bootstraping for spliting data into training and test set.因为我的数据是时间序列的，所以我想我不能使用引导将数据拆分为训练集和测试集。 So, my questions are: Am I right?所以，我的问题是：我是对的吗？ And If so - How to use createTimeSlices for model evaluation?如果是这样 - 如何使用 createTimeSlices 进行模型评估？

Answer 1

Note that the original question that you have posted, takes care of the timeSlicing, and you don't have to create timeSlices by hand.请注意，您发布的原始问题会处理时间切片，您不必手动创建时间切片。

However, here is how to use createTimeSlices for splitting the data and then using it for training and testing a model.但是，这里是如何使用createTimeSlices来拆分数据，然后使用它来训练和测试模型。

Step 0: Setting up the data and trainControl :(from your question)第 0 步：设置数据和trainControl :(来自您的问题)

library(caret)
library(ggplot2)
library(pls)

data(economics)

Step 1: Creating the timeSlices for the index of the data:第 1 步：为数据索引创建时间片：

timeSlices <- createTimeSlices(1:nrow(economics), 
                   initialWindow = 36, horizon = 12, fixedWindow = TRUE)

This creates a list of training and testing timeSlices.这将创建一个训练和测试时间片列表。

> str(timeSlices,max.level = 1)
## List of 2
## $ train:List of 431
##   .. [list output truncated]
## $ test :List of 431
##   .. [list output truncated]

For ease of understanding, I am saving them in separate variable:为了便于理解，我将它们保存在单独的变量中：

trainSlices <- timeSlices[[1]]
testSlices <- timeSlices[[2]]

Step 2: Training on the first of the trainSlices :第 2 步：在第一个trainSlices上进行训练：

plsFitTime <- train(unemploy ~ pce + pop + psavert,
                    data = economics[trainSlices[[1]],],
                    method = "pls",
                    preProc = c("center", "scale"))

Step 3: Testing on the first of the testSlices :第 3 步：在第一个testSlices上进行测试：

pred <- predict(plsFitTime,economics[testSlices[[1]],])

Step 4: Plotting:第 4 步：绘图：

true <- economics$unemploy[testSlices[[1]]]

plot(true, col = "red", ylab = "true (red) , pred (blue)", ylim = range(c(pred,true)))
points(pred, col = "blue")

You can then do this for all the slices:然后，您可以对所有切片执行此操作：

for(i in 1:length(trainSlices)){
  plsFitTime <- train(unemploy ~ pce + pop + psavert,
                      data = economics[trainSlices[[i]],],
                      method = "pls",
                      preProc = c("center", "scale"))
  pred <- predict(plsFitTime,economics[testSlices[[i]],])
  
  
  true <- economics$unemploy[testSlices[[i]]]
  plot(true, col = "red", ylab = "true (red) , pred (blue)", 
            main = i, ylim = range(c(pred,true)))
  points(pred, col = "blue") 
}

As mentioned earlier, this sort of timeSlicing is done by your original function in one step:如前所述，这种时间切片是由您的原始函数一步完成的：

> myTimeControl <- trainControl(method = "timeslice",
+                               initialWindow = 36,
+                               horizon = 12,
+                               fixedWindow = TRUE)
> 
> plsFitTime <- train(unemploy ~ pce + pop + psavert,
+                     data = economics,
+                     method = "pls",
+                     preProc = c("center", "scale"),
+                     trControl = myTimeControl)
> plsFitTime
Partial Least Squares 

478 samples
  5 predictors

Pre-processing: centered, scaled 
Resampling: Rolling Forecasting Origin Resampling (12 held-out with a fixed window) 

Summary of sample sizes: 36, 36, 36, 36, 36, 36, ... 

Resampling results across tuning parameters:

  ncomp  RMSE  Rsquared  RMSE SD  Rsquared SD
  1      1080  0.443     796      0.297      
  2      1090  0.43      845      0.295      

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was ncomp = 1.

Hope this helps!!希望这可以帮助！！

Answer 2

Shambho's answer provides decent example of how to use the caret package with TimeSlices, however, it can be misleading in terms of modelling technique. Shambho 的回答提供了如何将插入符号包与 TimeSlices 一起使用的不错示例，但是，它在建模技术方面可能会产生误导。 So in order not to misguide future readers that want to use the caret package for predictive modelling on time-series (and here I do not mean autoregressive models), I want to highlight a few things.因此，为了不误导希望使用 caret 包对时间序列进行预测建模的未来读者（这里我不是指自回归模型），我想强调一些事情。

The problem with time-series data is that look-ahead bias is easy if one is not careful.时间序列数据的问题在于，如果不小心，很容易出现前瞻偏差。 In this case, the economics data set has aligned data at their economic reporting dates and not their release date, which is never the case in real live applications (economic data points have different time stamps).在这种情况下，经济数据集在其经济报告日期而不是其发布日期对齐数据，这在实际应用中从未出现过（经济数据点具有不同的时间戳）。 Unemployment data may be two months behind the other indicators in terms of release date, which would then introduce a model bias in Shambho's example.就发布日期而言，失业数据可能比其他指标晚两个月，这将在 Shambho 的示例中引入模型偏差。

Next, this example is only descriptive statistics and not predictive (forecasting) because the data we want to forecast (unemploy) is not lagged correctly.接下来，这个例子只是描述性统计，而不是预测（预测），因为我们想要预测（失业）的数据没有正确滞后。 It merely trains a model to best explain the variation in unemployment (which also in this case is a stationary time-series creating all sorts of issues in modelling process) based on predictor variables at the same economic report dates.它只是训练一个模型，以根据同一经济报告日期的预测变量最好地解释失业率的变化（在这种情况下也是一个固定的时间序列，在建模过程中产生各种问题）。

Lastly, the 12-month horizon in this example is not a true multi-period forecasting as Hyndman does it in his examples.最后，本示例中的 12 个月范围并不是真正的多期预测，正如 Hyndman 在他的示例中所做的那样。

Hyndman on cross-validation for time-series Hyndman 关于时间序列的交叉验证

Answer 3

Actually, you can!其实，你可以！

First, let me give you a scholarly article on the topic .首先，让我给你一篇关于这个主题的学术文章。

In R:在 R 中：

Using the package caret , createResample can be used to make simple bootstrap samples and createFolds can be used to generate balanced cross–validation groupings from a set of data.使用包caret ， createResample可用于制作简单的引导样本， createFolds可用于从一组数据生成平衡的交叉验证分组。 So you'll probably want to use createResample .所以你可能想要使用createResample 。 Here's an example of its usage:这是它的用法示例：

data(oil)
createDataPartition(oilType, 2)

x <- rgamma(50, 3, .5)
inA <- createDataPartition(x, list = FALSE)

plot(density(x[inA]))
rug(x[inA])

points(density(x[-inA]), type = "l", col = 4)
rug(x[-inA], col = 4)

createResample(oilType, 2)

createFolds(oilType, 10)
createFolds(oilType, 5, FALSE)

createFolds(rnorm(21))

createTimeSlices(1:9, 5, 1, fixedWindow = FALSE)
createTimeSlices(1:9, 5, 1, fixedWindow = TRUE)
createTimeSlices(1:9, 5, 3, fixedWindow = TRUE)
createTimeSlices(1:9, 5, 3, fixedWindow = FALSE)

The values you see in the createResample function are the data and the number of partitions to create, in this case 2. You can additionally specify if the results should be stored as a list with list = TRUE or list = FALSE .您在createResample函数中看到的值是数据和要创建的分区数，在本例中为 2。您可以另外指定结果是否应存储为list = TRUE或list = FALSE的列表。

Additionally, caret contains a function called createTimeSlices that can create the indices for this type of splitting.此外， caret包含一个名为createTimeSlices的函数，可以为这种类型的拆分创建索引。

The three parameters for this type of splitting are:这种拆分的三个参数是：

initialWindow : the initial number of consecutive values in each training set sample initialWindow ：每个训练集样本中连续值的初始个数
horizon : The number of consecutive values in test set sample horizon ：测试集样本中连续值的数量
fixedWindow : A logical: if FALSE, the training set always start at the first sample and the training set size will vary over data splits. fixedWindow ：一个逻辑：如果为 FALSE，则训练集始终从第一个样本开始，并且训练集的大小将随着数据拆分而变化。

Usage:用法：

createDataPartition(y, 
                    times = 1,
                    p = 0.5,
                    list = TRUE,
                    groups = min(5, length(y)))
createResample(y, times = 10, list = TRUE)
createFolds(y, k = 10, list = TRUE, returnTrain = FALSE)
createMultiFolds(y, k = 10, times = 5)
createTimeSlices(y, initialWindow, horizon = 1, fixedWindow = TRUE)

Sources:资料来源：

http://caret.r-forge.r-project.org/splitting.html http://caret.r-forge.r-project.org/splitting.html

http://eranraviv.com/blog/bootstrapping-time-series-r-code/ http://eranraviv.com/blog/bootstrapping-time-series-r-code/

http://rgm3.lab.nig.ac.jp/RGM/R_rdfile?f=caret/man/createDataPartition.Rd&d=R_CC http://rgm3.lab.nig.ac.jp/RGM/R_rdfile?f=caret/man/createDataPartition.Rd&d=R_CC

CARET. 插入符号。 Relationship between data splitting and trainControl 数据拆分与trainControl的关系