简体   繁体   English

使用dplyr和for循环添加多个滞后变量

[英]Adding multiple lag variables using dplyr and for loops

I have time series data that I'm predicting on, so I am creating lag variables to use in my statistical analysis. 我有预测的时间序列数据,因此我正在创建滞后变量以用于统计分析。 I'd like a quick way to create multiple variables given specific inputs so that I can easily cross-validate and compare models. 我想要一种快速的方法来创建给定特定输入的多个变量,以便我可以轻松地交叉验证和比较模型。

The following is example code that adds 2 lags for 2 different variables (4 total) given a certain category (A, B, C): 以下是示例代码,这些代码在给定类别(A,B,C)的情况下为2个不同的变量(总共4个)加了2个滞后:

# Load dplyr
library(dplyr)

# create day, category, and 2 value vectors
days = 1:9
cats = rep(c('A','B','C'),3)
set.seed = 19
values1 = round(rnorm(9, 16, 4))
values2 = round(rnorm(9, 16, 16))

# create data frame
data = data.frame(days, cats, values1, values2)

# mutate new lag variables 
LagVal = data %>% arrange(days) %>% group_by(cats) %>% 
  mutate(LagVal1.1 = lag(values1, 1)) %>%
  mutate(LagVal1.2 = lag(values1, 2)) %>%
  mutate(LagVal2.1 = lag(values2, 1)) %>%
  mutate(LagVal2.2 = lag(values2, 2))

LagVal

       days   cats values1 values2 LagVal1.1 LagVal1.2 LagVal2.1 LagVal2.2
  <int> <fctr>   <dbl>   <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1     1      A      16     -10        NA        NA        NA        NA
2     2      B      14      24        NA        NA        NA        NA
3     3      C      16      -6        NA        NA        NA        NA
4     4      A      12      25        16        NA       -10        NA
5     5      B      20      14        14        NA        24        NA
6     6      C      18      -5        16        NA        -6        NA
7     7      A      21       2        12        16        25       -10
8     8      B      19       5        20        14        14        24
9     9      C      18      -3        18        16        -5        -6

My problem comes in at the # mutate new lag variables step, since I have about a dozen predictor variables that I would potentially want to lag up to 10 times (~13k row dataset), and I don't have the heart to create 120 new variables. 我的问题出在# mutate new lag variables步骤,因为我有大约十二个预测变量,我可能希望滞后多达10次(〜13k行数据集),而且我无心创建120新变量。

Here is my attempt at writing a function which mutates new variables given the inputs for data (dataset to mutate), variables (the variables you wish to lag), and lags (the number of lags per variable): 这是我尝试编写的函数,该函数根据给定的data输入(要突变的data集), variables (您希望滞后的变量)和lags (每个变量的滞后数)来使新变量发生变化:

MultiMutate = function(data, variables, lags){
  # select the data to be working with
  FuncData = data
  # Loop through desired variables to mutate
  for (i in variables){
    # Loop through number of desired lags
    for (u in 1:lags){
      FuncData = FuncData %>% arrange(days) %>% group_by(cats) %>%
        # Mutate new variable for desired number of lags. Give new variable a name with the lag number appended
        mutate(paste(i, u) = lag(i, u))
    }
  }
  FuncData
}

To be honest I'm just sort of lost on how to get this to work. 老实说,我只是迷失在如何使它起作用上。 The ordering of my for-loops and overall logic makes sense, but the way the function takes characters into variables and the overall syntax seems way off. 我的for循环和整体逻辑的顺序很有意义,但是该函数将字符带入变量的方式以及整体语法似乎还遥遥无期。 Is there a simple way to fix up this function to get my desired result? 有没有简单的方法可以修复此功能以获得我想要的结果?

In particular, I'm looking for: 我特别在寻找:

  1. A function like MultiMutate(data = data, variables = c(values1, values2), lags = 2) that would create the exact result of LagVal from above. MultiMutate(data = data, variables = c(values1, values2), lags = 2)之类的LagVal将从上面创建LagVal的确切结果。

  2. Dynamically naming the variables based on the variable and their lag. 根据变量及其滞后动态命名变量。 Ie value1.1, value1.2, value2.1, value2.2, etc. 即value1.1,value1.2,value2.1,value2.2等。

Thank you in advance and let me know if you need additional information. 预先谢谢您,如果需要其他信息,请告诉我。 If there's a simpler way to get what I'm looking for, then I am all ears. 如果有一种更简单的方法来获取我想要的东西,那么我无所不能。

You'll have to reach deeper into the tidyverse toolbox to add them all at once. 您必须更深入tidyverse工具箱才能一次添加它们。 If you nest data for each value of cats , you can iterate over the nested data frames, iterating the lags over the values* columns in each. 如果您为cats每个值嵌套数据,则可以迭代嵌套的数据帧,迭代每个catsvalues*列的滞后。

library(tidyverse)
set.seed(47)

df <- data_frame(days = 1:9,
                 cats = rep(c('A','B','C'),3),
                 values1 = round(rnorm(9, 16, 4)),
                 values2 = round(rnorm(9, 16, 16)))


df %>% nest(-cats) %>% 
    mutate(lags = map(data, function(dat) {
        imap_dfc(dat[-1], ~set_names(map(1:2, lag, x = .x), 
                                     paste0(.y, '_lag', 1:2)))
        })) %>% 
    unnest() %>% 
    arrange(days)
#> # A tibble: 9 x 8
#>   cats   days values1 values2 values1_lag1 values1_lag2 values2_lag1
#>   <chr> <int>   <dbl>   <dbl>        <dbl>        <dbl>        <dbl>
#> 1 A         1     24.     -7.          NA           NA           NA 
#> 2 B         2     19.      1.          NA           NA           NA 
#> 3 C         3     17.     17.          NA           NA           NA 
#> 4 A         4     15.     24.          24.          NA           -7.
#> 5 B         5     16.    -13.          19.          NA            1.
#> 6 C         6     12.     17.          17.          NA           17.
#> 7 A         7     12.     27.          15.          24.          24.
#> 8 B         8     16.     15.          16.          19.         -13.
#> 9 C         9     15.     36.          12.          17.          17.
#> # ... with 1 more variable: values2_lag2 <dbl>

data.table::shift makes this simpler, as it's vectorized. data.table::shift使它变得更简单,因为它data.table::shift量化。 Naming takes more work than the actual lagging: 命名比实际滞后要花更多的工作:

library(data.table)

setDT(df)

df[, sapply(1:2, function(x){paste0('values', x, '_lag', 1:2)}) := shift(.SD, 1:2), 
   by = cats, .SDcols = values1:values2][]
#>    days cats values1 values2 values1_lag1 values1_lag2 values2_lag1
#> 1:    1    A      24      -7           NA           NA           NA
#> 2:    2    B      19       1           NA           NA           NA
#> 3:    3    C      17      17           NA           NA           NA
#> 4:    4    A      15      24           24           NA           -7
#> 5:    5    B      16     -13           19           NA            1
#> 6:    6    C      12      17           17           NA           17
#> 7:    7    A      12      27           15           24           24
#> 8:    8    B      16      15           16           19          -13
#> 9:    9    C      15      36           12           17           17
#>    values2_lag2
#> 1:           NA
#> 2:           NA
#> 3:           NA
#> 4:           NA
#> 5:           NA
#> 6:           NA
#> 7:           -7
#> 8:            1
#> 9:           17

In these cases, I rely on the magic of dplyr and tidyr : 在这些情况下,我依靠dplyrtidyr的魔力:

library(dplyr)
library(tidyr)

set.seed(47)

# create data
s_data = data_frame(
  days = 1:9,
  cats = rep(c('A', 'B', 'C'), 3),
  values1 = round(rnorm(9, 16, 4)),
  values2 = round(rnorm(9, 16, 16))
)

max_lag = 2 # define max number of lags

# create lags
s_data %>% 
  gather(select = -c("days", "cats")) %>% # gather all variables that will be lagged
  mutate(n_lag = list(0:max_lag)) %>% # add list-column with lag numbers
  unnest() %>% # unnest the list column
  arrange(cats, key, n_lag, days) %>% # order the data.frame
  group_by(cats, key, n_lag) %>% # group by relevant variables
  # create lag. when grouped by vars above, n_lag is a constant vector, take 1st value
  mutate(lag_val = lag(value, n_lag[1])) %>% 
  ungroup() %>% 
  # create some fancy labels 
  mutate(var_name = ifelse(n_lag == 0, key, paste0("Lag", key, ".", n_lag))) %>% 
  select(-c(key, value, n_lag)) %>% # drop unnecesary data
  spread(var_name, lag_val) %>% # spread your newly created variables
  select(days, cats, starts_with("val"), starts_with("Lag")) # reorder

## # A tibble: 9 x 8
##    days cats  values1 values2 Lagvalues1.1 Lagvalues1.2 Lagvalues2.1 Lagvalues2.2
##   <int> <chr>   <dbl>   <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
## 1     1 A         24.     -7.          NA           NA           NA           NA 
## 2     2 B         19.      1.          NA           NA           NA           NA 
## 3     3 C         17.     17.          NA           NA           NA           NA 
## 4     4 A         15.     24.          24.          NA           -7.          NA 
## 5     5 B         16.    -13.          19.          NA            1.          NA 
## 6     6 C         12.     17.          17.          NA           17.          NA 
## 7     7 A         12.     27.          15.          24.          24.          -7.
## 8     8 B         16.     15.          16.          19.         -13.           1.
## 9     9 C         15.     36.          12.          17.          17.          17.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM