在 tidyverts package 中按键创建时间序列交叉验证切片

Question

Is there a way to create time series cross validation sets by key using the tidyverts package?有没有办法使用 tidyverts package 按键创建时间序列交叉验证集？ I can't seem to get it right.我似乎无法正确处理。 Below is a reprex of my attempt.以下是我尝试的代表。

The example involves creating time series cross-validation (slices with 1 step ahead) for forecasting.该示例涉及为预测创建时间序列交叉验证（提前 1 步的切片）。 The key variable has 2 distinct values and I will like to have one tsibble containing the time series slices for both keys.键变量有 2 个不同的值，我希望有一个包含两个键的时间序列切片的 tsibble。 When I try to row-bind both tsibbles, I get an error.当我尝试对两个 tsibble 进行行绑定时，出现错误。

library(dplyr)
library(tibble)
library(tsibble)

# helper function
create_cv_slices <- function(data, forecast_horizon) {
  data %>%
    dplyr::slice(1:(nrow(data) - forecast_horizon)) %>%
    tsibble::stretch_tsibble(.init = nrow(data) - 2 * forecast_horizon, .step = 1)
}

# get data
raw_tsbl <- tibble::tribble(
  ~index,      ~key,    ~Revenue,     ~Claims,
  20160101, "series1",  11011836.1, 5386836.696,
  20160201, "series1", 11042641.16, 9967325.715,
  20160301, "series1", 11445687.52, 10947197.89,
  20160401, "series1", 11252943.11, 6980431.415,
  20160101, "series2",    12236155,    12526224,
  20160201, "series2",     8675364,     9812904,
  20160301, "series2",    10081130,     8423497,
  20160401, "series2",    14840111,     8079813
) %>%
  dplyr::mutate(index = tsibble::yearmonth(as.character(index))) %>%
  tsibble::as_tsibble(index = index, key = key)

keys <- unique(raw_tsbl$key)

# split & combine
tbl1 = raw_tsbl %>%
  dplyr::filter(key == keys[1]) %>%
  create_cv_slices(., forecast_horizon = 1) %>%
  tibble::as_tibble()

tbl2 = raw_tsbl %>%
  dplyr::filter(key == keys[2]) %>%
  create_cv_slices(., forecast_horizon = 1) %>%
  tibble::as_tibble()

dplyr::bind_rows(tbl1, tbl2) %>%
  tsibble::as_tsibble(index = index, key = key)
#> Error: A valid tsibble must have distinct rows identified by key and index.
#> Please use `duplicates()` to check the duplicated rows.

Thank you.谢谢你。

Answer 1

It appears that using bind_rows to combine the tsibbles is what doesn't work.似乎使用 bind_rows 来组合 tsibbles 是行不通的。 Using bind_rows and setting validate = FALSE in the as_tsibble function, creates a tsibble alright but it displays the tsibble as a daily series instead of monthly (which is what it should be).在as_tsibble function 中使用 bind_rows 并设置validate = FALSE可以创建一个 tsibble，但它会将 tsibble 显示为每日系列而不是每月（应该是这样）。 However, using rbind with the same argument setting, creates the desired tsibble.但是，使用具有相同参数设置的 rbind 会创建所需的 tsibble。

rbind(tbl1, tbl2) %>%
  tsibble::as_tsibble(index = index, key = c(key, .id), validate = F)

Thanks.谢谢。

Answer 2

Rather than splitting the data manually by key, you can compute your slices on groups of the tsibble.您可以在 tsibble 组上计算切片，而不是通过键手动拆分数据。 group_by_key() is a convenience function (with better performance) that is equivalent to group_by(key) . group_by_key()是一个方便的 function （具有更好的性能），相当于group_by(key) 。 The n() function is a group aware dplyr function which gives the number of observations for the current group. n() function 是一个组感知 dplyr function ，它给出了当前组的观察次数。

library(dplyr)
library(tibble)
library(tsibble)

# get data
raw_tsbl <- tibble::tribble(
  ~index,      ~key,    ~Revenue,     ~Claims,
  20160101, "series1",  11011836.1, 5386836.696,
  20160201, "series1", 11042641.16, 9967325.715,
  20160301, "series1", 11445687.52, 10947197.89,
  20160401, "series1", 11252943.11, 6980431.415,
  20160101, "series2",    12236155,    12526224,
  20160201, "series2",     8675364,     9812904,
  20160301, "series2",    10081130,     8423497,
  20160401, "series2",    14840111,     8079813
) %>%
  dplyr::mutate(index = tsibble::yearmonth(as.character(index))) %>%
  tsibble::as_tsibble(index = index, key = key)

forecast_horizon <- 1

raw_tsbl %>% 
  group_by_key() %>% 
  slice(1:(n() - forecast_horizon)) %>% 
  ungroup() %>% 
  stretch_tsibble(.init = 2, .step = 1)
#> # A tsibble: 10 x 5 [1M]
#> # Key:       .id, key [4]
#>       index key       Revenue    Claims   .id
#>       <mth> <chr>       <dbl>     <dbl> <int>
#>  1 2016 Jan series1 11011836.  5386837.     1
#>  2 2016 Feb series1 11042641.  9967326.     1
#>  3 2016 Jan series2 12236155  12526224      1
#>  4 2016 Feb series2  8675364   9812904      1
#>  5 2016 Jan series1 11011836.  5386837.     2
#>  6 2016 Feb series1 11042641.  9967326.     2
#>  7 2016 Mar series1 11445688. 10947198.     2
#>  8 2016 Jan series2 12236155  12526224      2
#>  9 2016 Feb series2  8675364   9812904      2
#> 10 2016 Mar series2 10081130   8423497      2

^{Created on 2020-05-08 by the reprex package (v0.3.0)}^{由代表 package (v0.3.0) 于 2020 年 5 月 8 日创建}

A slight difference in this code is that .init is set to 2, rather than nrow(data)-2*forecast_horizon .这段代码的细微差别是.init设置为 2，而不是nrow(data)-2*forecast_horizon 。 For this data it gives the same result, however the number of observations for each key differs it won't.对于此数据，它给出了相同的结果，但是每个键的观察次数不会有所不同。 Once dplyr v1.0.0 is released, it will be easier to use tools like group_map() or bind_rows() to use a split-apply-combine approach necessary to specify different window parameters for each key.一旦 dplyr v1.0.0 发布，使用group_map()或bind_rows()类的工具将更容易使用拆分-应用-组合方法为每个键指定不同的 window 参数。

在 tidyverts package 中按键创建时间序列交叉验证切片

问题描述

2 个解决方案

解决方案1
0 2020-05-06 02:03:43

解决方案2
0 2020-05-08 07:27:33

在 tidyverts package 中按键创建时间序列交叉验证切片

问题描述

2 个解决方案

解决方案1 0 2020-05-06 02:03:43

解决方案2 0 2020-05-08 07:27:33

解决方案1
0 2020-05-06 02:03:43

解决方案2
0 2020-05-08 07:27:33