简体   繁体   English

R data.table 创建自定义 function 使用 lapply 创建和重新分配多个变量

[英]R data.table creating a custom function using lapply to create and reassign multiple variables

I have the following lines of code:我有以下代码行:

DT[flag==T, temp:=haz_1.5]
DT[, temp:= na.locf(temp, na.rm = FALSE), "pid"]
DT[agedays==61, haz_1.5_1:=temp]

I need to convert this into a function, so that it will work on a list of variables, instead of just one single one.我需要将其转换为 function,以便它可以处理变量列表,而不仅仅是一个变量。 I have recently learned how to create a function using lapply by passing through a list of columns and conditions for the creation of one set of new columns.我最近学习了如何使用 lapply 创建 function,方法是传递列列表和创建一组新列的条件。 However I'm unsure of how to do it when I'm passing through a list of columns as well as carrying through all values of a variable forward on these columns.但是,当我通过列列表以及在这些列上传递变量的所有值时,我不确定如何执行此操作。

For instance, I can code the following:例如,我可以编写以下代码:

  columns<-c("haz_1.5", "waz_1.5")
  new_cols <- paste(columns, "1", sep = "_")
  x=61
  maled_anthro[(flag==TRUE)&(agedays==x), (new_cols) := lapply(.SD, function(y) na.locf(y,    na.rm=F)), .SDcols = columns] 

But I am missing the na.locf step and thus am not getting the same output as the original lines of code prior to building the function.但是我错过了 na.locf 步骤,因此在构建 function 之前,我没有得到与原始代码行相同的 output。 How would I incorporate the line of code which utilizes na.locf to carry forward values (DT[, temp:= na.locf(temp, na.rm = FALSE), "pid"]) into this function in a way in which all the data is wrapped up into the single function?我如何将利用 na.locf 将值 (DT[, temp:= na.locf(temp, na.rm = FALSE), "pid"]) 合并到此 function 中的代码行所有数据都包含在单个 function 中? Would this work with lapply in the same manner?这会以同样的方式与 lapply 一起工作吗?

Dummy data that's similar to the data table I'm using:类似于我正在使用的数据表的虚拟数据:

DT <- data.table(pid  = c(1,1,2,3,3,4,4,5,5,5),
                 flag = c(T,T,F,T,T,F,T,T,T,T),
                 agedays = c(1,61,61,51,61,23,61,1,32,61),
                 haz_1.5 = c(1,1,1,2,NA,1,3,2,3,4),
                 waz_1.5 = c(1,NA,NA,NA,NA,2,2,3,4,4))

OP's code can be turned into an anonymous function which is applied to the selected columns : OP 的代码可以转换为匿名 function 应用于选定的columns

library(data.table)
columns <- c("haz_1.5", "waz_1.5")
new_cols <- paste0(columns, "_1")
x <-  61

DT[, (new_cols) := lapply(.SD, function(v) {
  temp <- fifelse(flag, v, NA_real_)
  temp <- nafill(temp, "locf")
  fifelse(agedays == x, temp, NA_real_)
}), .SDcols = columns, by = pid][]
 pid flag agedays haz_1.5 waz_1.5 haz_1.5_1 waz_1.5_1 1: 1 TRUE 1 1 1 NA NA 2: 1 TRUE 61 1 NA 1 1 3: 2 FALSE 61 1 NA NA NA 4: 3 TRUE 51 2 NA NA NA 5: 3 TRUE 61 NA NA 2 NA 6: 4 FALSE 23 1 2 NA NA 7: 4 TRUE 61 3 2 3 2 8: 5 TRUE 1 2 3 NA NA 9: 5 TRUE 32 3 4 NA NA 10: 5 TRUE 61 4 4 4 4

This is the same result we would get when we manually repeat OP's code for the two columns (note that it is required to clear the temp column before assigning by reference parts of it.)这与我们手动重复两列的 OP 代码时得到的结果相同(请注意,在通过引用部分分配之前需要清除temp列。)

DT[(flag), temp := haz_1.5]
DT[, temp := zoo::na.locf(temp, na.rm = FALSE), by = pid]
DT[agedays == 61, haz_1.5_1 := temp]
DT[, temp := NULL]
DT[(flag), temp := waz_1.5]
DT[, temp := zoo::na.locf(temp, na.rm = FALSE), by = pid]
DT[agedays == 61, waz_1.5_1 := temp]
DT[, temp := NULL][]
 pid flag agedays haz_1.5 waz_1.5 haz_1.5_1 waz_1.5_1 1: 1 TRUE 1 1 1 NA NA 2: 1 TRUE 61 1 NA 1 1 3: 2 FALSE 61 1 NA NA NA 4: 3 TRUE 51 2 NA NA NA 5: 3 TRUE 61 NA NA 2 NA 6: 4 FALSE 23 1 2 NA NA 7: 4 TRUE 61 3 2 3 2 8: 5 TRUE 1 2 3 NA NA 9: 5 TRUE 32 3 4 NA NA 10: 5 TRUE 61 4 4 4 4

Some explanations一些解释

  • There is one important difference between OP's "single column" code and this approach: The anonymous function is called for each item in the grouping variable pid . OP 的“单列”代码和这种方法之间有一个重要区别:匿名 function 为分组变量pid中的每个项目调用。 In OP's code, the first and last assignments are working on the ungrouped (full) vectors (which might be somewhat more efficient, perhaps).在 OP 的代码中,第一个和最后一个分配正在处理未分组(完整)向量(这可能会更有效)。 However, the result of those assignments is independent of pid and the result is the same.但是,这些分配的结果与pid无关,结果是相同的。
  • Instead of zoo::na.locf() , data.table's nafill() function is used (new with data.table v1.12.4, on CRAN 03 Oct 2019)代替zoo::na.locf() ,使用 data.table 的nafill() function (新的 data.table v1.12.4,在 CRAN 2019 年 10 月 3 日)
  • DT[(flag), ...] is equivalent to DT[flag == TRUE, ...] DT[(flag), ...]等价于DT[flag == TRUE, ...]
  • When fifelse() is used instead of subsetted assign by reference , the no parameter must be NA to be compliant.当使用fifelse()而不是通过引用分配子集时, no参数必须为NA才能符合要求。 Thus, DT[, temp:= fifelse(flag, haz_1.5, NA_real_)][] is equivalent to DT[(flag), temp:= haz_1.5][]因此, DT[, temp:= fifelse(flag, haz_1.5, NA_real_)][]等价于DT[(flag), temp:= haz_1.5][]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM