将 function 应用于 data.table 中的每个指定列并通过引用更新

Question

我有一个 data.table ，我想在某些列上执行相同的操作。 这些列的名称在字符向量中给出。 在这个特定的示例中，我想将所有这些列乘以 -1。

一些玩具数据和指定相关列的向量：

library(data.table)
dt <- data.table(a = 1:3, b = 1:3, d = 1:3)
cols <- c("a", "b")

现在我正在这样做，循环字符向量：

for (col in 1:length(cols)) {
   dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
}

有没有办法在没有 for 循环的情况下直接做到这一点？

Answer 1

这似乎有效：

dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]

结果是

    a  b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3

这里有几个技巧：

因为(cols) :=有括号，所以结果被分配给cols指定的列，而不是一些名为“cols”的新变量。
.SDcols告诉调用，我们只能看着那些列，并允许我们使用.SD中， S的的ubset D与这些列相关ATA。
lapply(.SD, ...)对.SD ，它是一个列列表（就像所有的 data.frames 和 data.tables）。 lapply返回一个列表，所以最后j看起来像cols := list(...) 。

编辑：这是另一种可能更快的方法，正如@Arun 提到的：

for (j in cols) set(dt, j = j, value = -dt[[j]])

Answer 2

当您还想更改列的名称时，我想添加一个答案。 如果您想计算多列的对数，这将非常方便，这在实证工作中很常见。

cols <- c("a", "b")
out_cols = paste("log", cols, sep = ".")
dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols]

Answer 3

更新：以下是一种无需 for 循环的巧妙方法

dt[,(cols):= - dt[,..cols]]

这是一种易于代码可读性的巧妙方法。 但至于性能，根据以下微基准测试结果，它落后于 Frank 的解决方案

mbm = microbenchmark(
  base = for (col in 1:length(cols)) {
    dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
  },
  franks_solution1 = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
  franks_solution2 =  for (j in cols) set(dt, j = j, value = -dt[[j]]),
  hannes_solution = dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols],
  orhans_solution = for (j in cols) dt[,(j):= -1 * dt[,  ..j]],
  orhans_solution2 = dt[,(cols):= - dt[,..cols]],
  times=1000
)
mbm

Unit: microseconds
expr                  min        lq      mean    median       uq       max neval
base_solution    3874.048 4184.4070 5205.8782 4452.5090 5127.586 69641.789  1000  
franks_solution1  313.846  349.1285  448.4770  379.8970  447.384  5654.149  1000    
franks_solution2 1500.306 1667.6910 2041.6134 1774.3580 1961.229  9723.070  1000    
hannes_solution   326.154  405.5385  561.8263  495.1795  576.000 12432.400  1000
orhans_solution  3747.690 4008.8175 5029.8333 4299.4840 4933.739 35025.202  1000  
orhans_solution2  752.000  831.5900 1061.6974  897.6405 1026.872  9913.018  1000

如下图所示

我以前的答案：以下也有效

for (j in cols)
  dt[,(j):= -1 * dt[,  ..j]]

Answer 4

上述解决方案似乎都不适用于按组计算。 以下是我得到的最好的：

for(col in cols)
{
    DT[, (col) := scale(.SD[[col]], center = TRUE, scale = TRUE), g]
}

Answer 5

添加示例以基于列的字符串向量创建新列。 基于 Jfly 的回答：

dt <- data.table(a = rnorm(1:100), b = rnorm(1:100), c = rnorm(1:100), g = c(rep(1:10, 10)))

col0 <- c("a", "b", "c")
col1 <- paste0("max.", col0)  

for(i in seq_along(col0)) {
  dt[, (col1[i]) := max(get(col0[i])), g]
}

dt[,.N, c("g", col1)]

Answer 6

library(data.table)
(dt <- data.table(a = 1:3, b = 1:3, d = 1:3))

Hence:

   a b d
1: 1 1 1
2: 2 2 2
3: 3 3 3

Whereas (dt*(-1)) yields:

    a  b  d
1: -1 -1 -1
2: -2 -2 -2
3: -3 -3 -3

Answer 7

dplyr函数适用于data.table s，所以这里有一个dplyr解决方案，它也“避免了 for 循环”:)

dt %>% mutate(across(all_of(cols), ~ -1 * .))

我使用奥尔罕的代码（添加的行和列），你会看到它为基准dplyr::mutate与across大多执行比大多数其他解决方案的速度越来越慢比使用lapply的data.table解决方案。

library(data.table); library(dplyr)
dt <- data.table(a = 1:100000, b = 1:100000, d = 1:100000) %>% 
  mutate(a2 = a, a3 = a, a4 = a, a5 = a, a6 = a)
cols <- c("a", "b", "a2", "a3", "a4", "a5", "a6")

dt %>% mutate(across(all_of(cols), ~ -1 * .))
#>               a       b      d      a2      a3      a4      a5      a6
#>      1:      -1      -1      1      -1      -1      -1      -1      -1
#>      2:      -2      -2      2      -2      -2      -2      -2      -2
#>      3:      -3      -3      3      -3      -3      -3      -3      -3
#>      4:      -4      -4      4      -4      -4      -4      -4      -4
#>      5:      -5      -5      5      -5      -5      -5      -5      -5
#>     ---                                                               
#>  99996:  -99996  -99996  99996  -99996  -99996  -99996  -99996  -99996
#>  99997:  -99997  -99997  99997  -99997  -99997  -99997  -99997  -99997
#>  99998:  -99998  -99998  99998  -99998  -99998  -99998  -99998  -99998
#>  99999:  -99999  -99999  99999  -99999  -99999  -99999  -99999  -99999
#> 100000: -100000 -100000 100000 -100000 -100000 -100000 -100000 -100000

library(microbenchmark)
mbm = microbenchmark(
  base_with_forloop = for (col in 1:length(cols)) {
    dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
  },
  franks_soln1_w_lapply = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
  franks_soln2_w_forloop =  for (j in cols) set(dt, j = j, value = -dt[[j]]),
  orhans_soln_w_forloop = for (j in cols) dt[,(j):= -1 * dt[,  ..j]],
  orhans_soln2 = dt[,(cols):= - dt[,..cols]],
  dplyr_soln = (dt %>% mutate(across(all_of(cols), ~ -1 * .))),
  times=1000
)

library(ggplot2)
ggplot(mbm) +
  geom_violin(aes(x = expr, y = time)) +
  coord_flip()

^{由reprex 包(v0.3.0) 于 2020 年 10 月 16 日创建}

将 function 应用于 data.table 中的每个指定列并通过引用更新

问题描述

7 个解决方案

解决方案1
159 已采纳 2013-05-30 21:59:05

解决方案2
20 2017-03-30 08:16:21

解决方案3
11 2018-04-02 12:57:43

解决方案4
2 2018-11-19 18:43:21

解决方案5
1 2019-02-04 10:29:06

解决方案6
0 2019-01-23 16:12:24

解决方案7
0 2020-10-16 11:25:22

将 function 应用于 data.table 中的每个指定列并通过引用更新

问题描述

7 个解决方案

解决方案1 159 已采纳 2013-05-30 21:59:05

解决方案2 20 2017-03-30 08:16:21

解决方案3 11 2018-04-02 12:57:43

解决方案4 2 2018-11-19 18:43:21

解决方案5 1 2019-02-04 10:29:06

解决方案6 0 2019-01-23 16:12:24

解决方案7 0 2020-10-16 11:25:22

解决方案1
159 已采纳 2013-05-30 21:59:05

解决方案2
20 2017-03-30 08:16:21

解决方案3
11 2018-04-02 12:57:43

解决方案4
2 2018-11-19 18:43:21

解决方案5
1 2019-02-04 10:29:06

解决方案6
0 2019-01-23 16:12:24

解决方案7
0 2020-10-16 11:25:22