简体   繁体   中英

splitting data for time series prediction

I am looking for an R package, which allows me to do n-fold CV type hyper parameter optimisation (eg n = 10). Let us say this is the data I can used to tweak the hyper parameters (I tend to use rBayesianOptimization so let us abstract this away):

dates <- seq(as.Date('2017-01-01'), as.Date('2019-12-31'), by = 'days')

df <- data.frame(date = dates)
df$y <- 42

Here y, the dependent variable, is an obviously well known constant and it is just added here without being exploited.

I came across the caret function createTimeSlices and this would be a possible approach to split the data:

slices <- createTimeSlices(df$date, initialWindow = 365 * 2.5, horizon = 30, fixedWindow = TRUE)

I end up with a list like this:

List of 2
 $ train:List of 153
  ..$ Training0912.5: int [1:912] 1 2 3 4 5 6 7 8 9 10 ...
  ..$ Training0913.5: int [1:912] 2 3 4 5 6 7 8 9 10 11 ...
...   
  ..$ Training1010.5: int [1:912] 99 100 101 102 103 104 105 106 107 108 ...
  .. [list output truncated]
 $ test :List of 153
  ..$ Testing0912.5: num [1:30] 914 914 916 916 918 ...
  ..$ Testing0913.5: num [1:30] 914 916 916 918 918 ...

Can someone please point pout how to use this or refer me to another package? Personally, I am a bit confused about the training data indices only to shift 1 day (?). I would have thought it shifts 30 days (see horizon).

Thanks.

I found a way to use createTimeSlices inspired by Shambho's SO answer .

library(caret)

dates <- seq(as.Date('2017-01-01'), as.Date('2019-12-31'), by = 'days')

df <- data.frame(date = dates)
df$x <- 1
df$y <- 42

timeSlices <- createTimeSlices(1:nrow(df), initialWindow = 365 * 2, horizon = 30, fixedWindow = TRUE, skip = 30)

#str(timeSlices, max.level = 1)

trainSlices <- timeSlices[[1]]
testSlices <- timeSlices[[2]]

for (i in 1:length(trainSlices)) {

    train <- df[trainSlices[[i]],]
    test <- df[testSlices[[i]],]

    # fit and calculate performance on test to ultimately get average etc.

    print(paste0(min(train$date), " - ", max(train$date)))
    print(paste0(min(test$date), " - ", max(test$date)))
    print("")
}

The key for me was to specify skip as otherwise the window would only move 1 day and one would end up with to many "folds".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM