简体   繁体   English

R - 如何在 data.table 中最有效地滞后/领先多个列

[英]R - How do I lag/lead multiple columns in a data.table by multiple periods most efficiently

Having a large data.table that stores one date column (monthly) and then a bunch of different variables of interest measured at the respective dates for various subjects/IDs.有一个大的 data.table 存储一个日期列(每月),然后在各个日期为各种主题/ID 测量一堆不同的感兴趣变量。 Now I want to add for a subset of those variables (only some columns) newly computed columns that lead AND lag those columns by multiple periods all at once.现在,我想为这些变量的一个子集(仅一些列)添加新计算的列,这些列同时领先滞后这些列多个时期。 Is that doable?那可行吗? See below for the illustration of some example data that represents the high-level structure of my table and for what I tried so far请参阅下面的一些示例数据的说明,这些示例数据代表了我的表的高级结构以及我迄今为止尝试过的内容

Date        ID   Var_A   Var_B   Var_C
2000-01-31  1    100     500     1000
2000-02-28  1    200     600     1100
2000-03-31  1    300     700     1200
2000-04-30  1    400     800     1300 
2000-01-31  2    100     500     1000
2000-02-28  2    200     600     1100
2000-03-31  2    300     700     1200
2000-04-30  2    400     800     1300

dt[, `:=`(Var_A_Lag_1 = shift(Var_A_Lag_1, 1),
          Var_A_Lead_1 = shift(Var_A_Lead_1, 1, type = 'lead'),
          Var_A_Lag_2 = shift(Var_A_Lag_1, 2),
          Var_A_Lead_2 = shift(Var_A_Lead_1, 2, type = 'lead'),
          Var_B_Lag_1 = shift(Var_B_Lag_1, 1),
          Var_B_Lead_1 = shift(Var_B_Lead_1, 1, type = 'lead'),
          Var_B_Lag_2 = shift(Var_B_Lag_1, 2),
          Var_B_Lead_2 = shift(Var_B_Lead_1, 2, type = 'lead')),
   by = ID]

But that cannot be efficient can it?但这不能有效吗? I tried sth that I thought was very intuitive and would work but no luck.我尝试了一些我认为非常直观并且可以工作但没有运气的东西。

cols_to_edit <- which(sapply(dt, is.numeric))
cols_to_edit <- colnames(dt[, cols_to_edit, with = FALSE])

# col names od shifted variables
col_names_lag_1 <- paste(cols_to_edit, "lag_1", sep = "_")
col_names_lag_2 <- paste(cols_to_edit, "lag_2", sep = "_")
col_names_lead_1 <- paste(cols_to_edit, "lead_1", sep = "_")
col_names_lead_2 <- paste(cols_to_edit, "lead_2", sep = "_")

# colnames for differences 
col_names_lag_1_d <- paste("d", cols_to_edit, "lag_1", sep = "_")
col_names_lag_2_d <- paste("d", cols_to_edit, "lag_2", sep = "_")
col_names_lead_1_d <- paste("d", cols_to_edit, "lead_1", sep = "_")
col_names_lead_2_d <- paste("d", cols_to_edit, "lead_2", sep = "_")

# Execute the shift command
dt_2[, (col_names_lag_1) := shift(cols_to_edit, 1), by = ID] 
# would have repeated for all new columns as defined above but it is not working. 

I basically want all numeric variables in this table shifted, say by 1 and 2 respectively in either directio.我基本上希望该表中的所有数字变量在任一方向上分别移动 1 和 2。 The newly computed values should then be assigned to the columns named by the name vector declared above.然后应将新计算的值分配给由上面声明的名称向量命名的列。 Didn't find any other question that was similiar to my case here.在这里没有找到与我的情况类似的任何其他问题。 Do you have any idea or know a best practice for doing this?你有什么想法或知道这样做的最佳实践吗?

The context: The variables are selected metrics as input for a regression model requiring the input to be available in that format.上下文:变量是选择的指标作为回归 model 的输入,要求输入以该格式可用。

How's this in a nice little for loop:这在一个不错的小for循环中如何:

cols <- grep("Var", names(dt), value = TRUE)
for ( i in 1:2 ) { # update for the number of shifts
  
  lag_names <- paste(cols, "Lag", i, sep = "_")
  dt[, (lag_names) := lapply(.SD, shift, i, type = "lag"), .SDcols = cols]
  
  lead_names <- paste(cols, "Lead", i, sep = "_")
  dt[, (lead_names) := lapply(.SD, shift, i, type = "lead"), .SDcols = cols]
  
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM