Data.table 以列列表為條件

Question

我的代碼示例如下。 我認為它可能比我能更好地解釋事情。 我明白為什么這不起作用 - R 對列名執行 boolean 操作，而不是列中的值，但我不確定如何使其工作。

DT = data.table( a = 1:5,
                 b = 6:10,
                 a_valid = c(0,1,1,0,0),
                 b_valid = c(1,1,0,0,0)
)

# This works
DT[a_valid == 0, a := NA]

numeric_columns <- c('a', 'b')
binary_columns <- c('a_valid', 'b_valid')

# This doesn't.
DT[binary_columns == 0, numeric_columns := NA]

Answer 1

你可以使用一個循環：

for (i in seq_along(numeric_columns)) {
  DT[get(binary_columns[i]) == 0, (numeric_columns[i]) := NA]
}

使用set()應該稍微快一點：

for (i in seq_along(numeric_columns)) {
  set(
    DT, 
    i = which(DT[[binary_columns[i]]] == 0), 
    j = numeric_columns[i], 
    value = NA_integer_
  )
}

或者切換到基本 R 一會兒：

setDF(DT)
DT[numeric_columns][DT[binary_columns] == 0] <- NA
setDT(DT)

Answer 2

我會在@sindri_baldur 解決方案中添加使用lapply的可能性：

lapply(seq_along(numeric_columns), function(i) DT[get(binary_columns[i]) == 0, (numeric_columns[i]) := NA])

它將避免for循環的開銷。

一些基准可以幫助選擇最佳解決方案

library(data.table)

DT = data.table(a = 1:1e5,
                b = 1:1e5 + 1e5,
                a_valid = sample(c(0,1), size = 1e5, replace = TRUE),
                b_valid = sample(c(0,1), size = 1e5, replace = TRUE)
)
numeric_columns <- c('a', 'b')
binary_columns <- c('a_valid', 'b_valid')

dt2 <- copy(DT)
dt3 <- copy(DT)
dt4 <- copy(DT)

microbenchmark::microbenchmark(
  for (i in seq_along(numeric_columns)) {
    dt2[get(binary_columns[i]) == 0, (numeric_columns[i]) := NA]
  },
  lapply(seq_along(numeric_columns), function(i) dt3[get(binary_columns[i]) == 0, (numeric_columns[i]) := NA]),
  for(j in 1:2) {
    i1 <- which(dt4[[j]] == 0)
    set(
      dt4, 
      i = which(dt4[[binary_columns[i]]] == 0), 
      j = numeric_columns[i], 
      value = NA_integer_
    )
  }  
)

#       min        lq      mean    median        uq       max neval
#  9.962940 10.104035 11.278033 10.226006 10.555132  22.10373   100
#  4.453995  4.535093  7.726525  4.659652  4.830672 234.04730   100
# 11.781060 11.913439 13.056660 12.021012 12.365140  26.84604   100

在這種情況下，獲勝者是lapply解決方案。 如果您在兩列以上需要這種東西，那么set解決方案可能會更好

Data.table 以列列表為條件

問題描述

2 個解決方案

解決方案1
2 2020-04-09 17:47:58

解決方案2
1 2020-04-09 18:38:58

Data.table 以列列表為條件

問題描述

2 個解決方案

解決方案1 2 2020-04-09 17:47:58

解決方案2 1 2020-04-09 18:38:58

解決方案1
2 2020-04-09 17:47:58

解決方案2
1 2020-04-09 18:38:58