简体   繁体   中英

Adding multiple lag variables using dplyr and for loops

I have time series data that I'm predicting on, so I am creating lag variables to use in my statistical analysis. I'd like a quick way to create multiple variables given specific inputs so that I can easily cross-validate and compare models.

The following is example code that adds 2 lags for 2 different variables (4 total) given a certain category (A, B, C):

# Load dplyr
library(dplyr)

# create day, category, and 2 value vectors
days = 1:9
cats = rep(c('A','B','C'),3)
set.seed = 19
values1 = round(rnorm(9, 16, 4))
values2 = round(rnorm(9, 16, 16))

# create data frame
data = data.frame(days, cats, values1, values2)

# mutate new lag variables 
LagVal = data %>% arrange(days) %>% group_by(cats) %>% 
  mutate(LagVal1.1 = lag(values1, 1)) %>%
  mutate(LagVal1.2 = lag(values1, 2)) %>%
  mutate(LagVal2.1 = lag(values2, 1)) %>%
  mutate(LagVal2.2 = lag(values2, 2))

LagVal

       days   cats values1 values2 LagVal1.1 LagVal1.2 LagVal2.1 LagVal2.2
  <int> <fctr>   <dbl>   <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1     1      A      16     -10        NA        NA        NA        NA
2     2      B      14      24        NA        NA        NA        NA
3     3      C      16      -6        NA        NA        NA        NA
4     4      A      12      25        16        NA       -10        NA
5     5      B      20      14        14        NA        24        NA
6     6      C      18      -5        16        NA        -6        NA
7     7      A      21       2        12        16        25       -10
8     8      B      19       5        20        14        14        24
9     9      C      18      -3        18        16        -5        -6

My problem comes in at the # mutate new lag variables step, since I have about a dozen predictor variables that I would potentially want to lag up to 10 times (~13k row dataset), and I don't have the heart to create 120 new variables.

Here is my attempt at writing a function which mutates new variables given the inputs for data (dataset to mutate), variables (the variables you wish to lag), and lags (the number of lags per variable):

MultiMutate = function(data, variables, lags){
  # select the data to be working with
  FuncData = data
  # Loop through desired variables to mutate
  for (i in variables){
    # Loop through number of desired lags
    for (u in 1:lags){
      FuncData = FuncData %>% arrange(days) %>% group_by(cats) %>%
        # Mutate new variable for desired number of lags. Give new variable a name with the lag number appended
        mutate(paste(i, u) = lag(i, u))
    }
  }
  FuncData
}

To be honest I'm just sort of lost on how to get this to work. The ordering of my for-loops and overall logic makes sense, but the way the function takes characters into variables and the overall syntax seems way off. Is there a simple way to fix up this function to get my desired result?

In particular, I'm looking for:

  1. A function like MultiMutate(data = data, variables = c(values1, values2), lags = 2) that would create the exact result of LagVal from above.

  2. Dynamically naming the variables based on the variable and their lag. Ie value1.1, value1.2, value2.1, value2.2, etc.

Thank you in advance and let me know if you need additional information. If there's a simpler way to get what I'm looking for, then I am all ears.

You'll have to reach deeper into the tidyverse toolbox to add them all at once. If you nest data for each value of cats , you can iterate over the nested data frames, iterating the lags over the values* columns in each.

library(tidyverse)
set.seed(47)

df <- data_frame(days = 1:9,
                 cats = rep(c('A','B','C'),3),
                 values1 = round(rnorm(9, 16, 4)),
                 values2 = round(rnorm(9, 16, 16)))


df %>% nest(-cats) %>% 
    mutate(lags = map(data, function(dat) {
        imap_dfc(dat[-1], ~set_names(map(1:2, lag, x = .x), 
                                     paste0(.y, '_lag', 1:2)))
        })) %>% 
    unnest() %>% 
    arrange(days)
#> # A tibble: 9 x 8
#>   cats   days values1 values2 values1_lag1 values1_lag2 values2_lag1
#>   <chr> <int>   <dbl>   <dbl>        <dbl>        <dbl>        <dbl>
#> 1 A         1     24.     -7.          NA           NA           NA 
#> 2 B         2     19.      1.          NA           NA           NA 
#> 3 C         3     17.     17.          NA           NA           NA 
#> 4 A         4     15.     24.          24.          NA           -7.
#> 5 B         5     16.    -13.          19.          NA            1.
#> 6 C         6     12.     17.          17.          NA           17.
#> 7 A         7     12.     27.          15.          24.          24.
#> 8 B         8     16.     15.          16.          19.         -13.
#> 9 C         9     15.     36.          12.          17.          17.
#> # ... with 1 more variable: values2_lag2 <dbl>

data.table::shift makes this simpler, as it's vectorized. Naming takes more work than the actual lagging:

library(data.table)

setDT(df)

df[, sapply(1:2, function(x){paste0('values', x, '_lag', 1:2)}) := shift(.SD, 1:2), 
   by = cats, .SDcols = values1:values2][]
#>    days cats values1 values2 values1_lag1 values1_lag2 values2_lag1
#> 1:    1    A      24      -7           NA           NA           NA
#> 2:    2    B      19       1           NA           NA           NA
#> 3:    3    C      17      17           NA           NA           NA
#> 4:    4    A      15      24           24           NA           -7
#> 5:    5    B      16     -13           19           NA            1
#> 6:    6    C      12      17           17           NA           17
#> 7:    7    A      12      27           15           24           24
#> 8:    8    B      16      15           16           19          -13
#> 9:    9    C      15      36           12           17           17
#>    values2_lag2
#> 1:           NA
#> 2:           NA
#> 3:           NA
#> 4:           NA
#> 5:           NA
#> 6:           NA
#> 7:           -7
#> 8:            1
#> 9:           17

In these cases, I rely on the magic of dplyr and tidyr :

library(dplyr)
library(tidyr)

set.seed(47)

# create data
s_data = data_frame(
  days = 1:9,
  cats = rep(c('A', 'B', 'C'), 3),
  values1 = round(rnorm(9, 16, 4)),
  values2 = round(rnorm(9, 16, 16))
)

max_lag = 2 # define max number of lags

# create lags
s_data %>% 
  gather(select = -c("days", "cats")) %>% # gather all variables that will be lagged
  mutate(n_lag = list(0:max_lag)) %>% # add list-column with lag numbers
  unnest() %>% # unnest the list column
  arrange(cats, key, n_lag, days) %>% # order the data.frame
  group_by(cats, key, n_lag) %>% # group by relevant variables
  # create lag. when grouped by vars above, n_lag is a constant vector, take 1st value
  mutate(lag_val = lag(value, n_lag[1])) %>% 
  ungroup() %>% 
  # create some fancy labels 
  mutate(var_name = ifelse(n_lag == 0, key, paste0("Lag", key, ".", n_lag))) %>% 
  select(-c(key, value, n_lag)) %>% # drop unnecesary data
  spread(var_name, lag_val) %>% # spread your newly created variables
  select(days, cats, starts_with("val"), starts_with("Lag")) # reorder

## # A tibble: 9 x 8
##    days cats  values1 values2 Lagvalues1.1 Lagvalues1.2 Lagvalues2.1 Lagvalues2.2
##   <int> <chr>   <dbl>   <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
## 1     1 A         24.     -7.          NA           NA           NA           NA 
## 2     2 B         19.      1.          NA           NA           NA           NA 
## 3     3 C         17.     17.          NA           NA           NA           NA 
## 4     4 A         15.     24.          24.          NA           -7.          NA 
## 5     5 B         16.    -13.          19.          NA            1.          NA 
## 6     6 C         12.     17.          17.          NA           17.          NA 
## 7     7 A         12.     27.          15.          24.          24.          -7.
## 8     8 B         16.     15.          16.          19.         -13.           1.
## 9     9 C         15.     36.          12.          17.          17.          17.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM