[英]R - How do I lag/lead multiple columns in a data.table by multiple periods most efficiently
Having a large data.table that stores one date column (monthly) and then a bunch of different variables of interest measured at the respective dates for various subjects/IDs.有一个大的 data.table 存储一个日期列(每月),然后在各个日期为各种主题/ID 测量一堆不同的感兴趣变量。 Now I want to add for a subset of those variables (only some columns) newly computed columns that lead AND lag those columns by multiple periods all at once.
现在,我想为这些变量的一个子集(仅一些列)添加新计算的列,这些列同时领先和滞后这些列多个时期。 Is that doable?
那可行吗? See below for the illustration of some example data that represents the high-level structure of my table and for what I tried so far
请参阅下面的一些示例数据的说明,这些示例数据代表了我的表的高级结构以及我迄今为止尝试过的内容
Date ID Var_A Var_B Var_C
2000-01-31 1 100 500 1000
2000-02-28 1 200 600 1100
2000-03-31 1 300 700 1200
2000-04-30 1 400 800 1300
2000-01-31 2 100 500 1000
2000-02-28 2 200 600 1100
2000-03-31 2 300 700 1200
2000-04-30 2 400 800 1300
dt[, `:=`(Var_A_Lag_1 = shift(Var_A_Lag_1, 1),
Var_A_Lead_1 = shift(Var_A_Lead_1, 1, type = 'lead'),
Var_A_Lag_2 = shift(Var_A_Lag_1, 2),
Var_A_Lead_2 = shift(Var_A_Lead_1, 2, type = 'lead'),
Var_B_Lag_1 = shift(Var_B_Lag_1, 1),
Var_B_Lead_1 = shift(Var_B_Lead_1, 1, type = 'lead'),
Var_B_Lag_2 = shift(Var_B_Lag_1, 2),
Var_B_Lead_2 = shift(Var_B_Lead_1, 2, type = 'lead')),
by = ID]
But that cannot be efficient can it?但这不能有效吗? I tried sth that I thought was very intuitive and would work but no luck.
我尝试了一些我认为非常直观并且可以工作但没有运气的东西。
cols_to_edit <- which(sapply(dt, is.numeric))
cols_to_edit <- colnames(dt[, cols_to_edit, with = FALSE])
# col names od shifted variables
col_names_lag_1 <- paste(cols_to_edit, "lag_1", sep = "_")
col_names_lag_2 <- paste(cols_to_edit, "lag_2", sep = "_")
col_names_lead_1 <- paste(cols_to_edit, "lead_1", sep = "_")
col_names_lead_2 <- paste(cols_to_edit, "lead_2", sep = "_")
# colnames for differences
col_names_lag_1_d <- paste("d", cols_to_edit, "lag_1", sep = "_")
col_names_lag_2_d <- paste("d", cols_to_edit, "lag_2", sep = "_")
col_names_lead_1_d <- paste("d", cols_to_edit, "lead_1", sep = "_")
col_names_lead_2_d <- paste("d", cols_to_edit, "lead_2", sep = "_")
# Execute the shift command
dt_2[, (col_names_lag_1) := shift(cols_to_edit, 1), by = ID]
# would have repeated for all new columns as defined above but it is not working.
I basically want all numeric variables in this table shifted, say by 1 and 2 respectively in either directio.我基本上希望该表中的所有数字变量在任一方向上分别移动 1 和 2。 The newly computed values should then be assigned to the columns named by the name vector declared above.
然后应将新计算的值分配给由上面声明的名称向量命名的列。 Didn't find any other question that was similiar to my case here.
在这里没有找到与我的情况类似的任何其他问题。 Do you have any idea or know a best practice for doing this?
你有什么想法或知道这样做的最佳实践吗?
The context: The variables are selected metrics as input for a regression model requiring the input to be available in that format.上下文:变量是选择的指标作为回归 model 的输入,要求输入以该格式可用。
How's this in a nice little for
loop:这在一个不错的小
for
循环中如何:
cols <- grep("Var", names(dt), value = TRUE)
for ( i in 1:2 ) { # update for the number of shifts
lag_names <- paste(cols, "Lag", i, sep = "_")
dt[, (lag_names) := lapply(.SD, shift, i, type = "lag"), .SDcols = cols]
lead_names <- paste(cols, "Lead", i, sep = "_")
dt[, (lead_names) := lapply(.SD, shift, i, type = "lead"), .SDcols = cols]
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.