简体   繁体   English

将 function 应用于 data.table 或 data.frame 中的多对列的最优雅方法是什么?

[英]What is the most elegant way to apply a function to multiple pairs of columns in a data.table or data.frame?

I often need to apply some function or operation to a pair of columns in a data.table or data.frame in wide format.我经常需要对宽格式的 data.table 或 data.frame 中的一对列应用一些 function 或操作。 For example, calculate difference between weight of a patient before and after treatment.例如,计算治疗前后患者体重的差异。

Often, there are multiple pairs of columns, that need the same operation to be applied.通常,有多对列,需要应用相同的操作。 For example, to calculate differences between weight, bmi, blood pressure, leucocyte count, ... of a patient, each before and after treatment.例如,计算患者在治疗前后的体重、bmi、血压、白细胞计数……之间的差异。

What is the least verbose way to do this in R, especially when using the data.table package?在 R 中执行此操作的最简洁的方法是什么,尤其是在使用 data.table package 时? I found the following solutions to work, but they will produce overhead in real world proplems, when variable names do not follow a perfect pattern.我发现以下解决方案可行,但是当变量名称不遵循完美模式时,它们会在现实世界中产生开销。

Consider the following minimal working example.考虑以下最小的工作示例。 The goal is to calculate differences of a.1 and a.2, b.1 and b.2, c.1 and c.2, and having them named a.3, b.3, c.3.目标是计算 a.1 和 a.2、b.1 和 b.2、c.1 和 c.2 的差异,并将它们命名为 a.3、b.3、Z4A8A0834909D3408B7357F。 What I especially don't like is having to rename the columns "manually" at the end.我特别不喜欢最后必须“手动”重命名列。

library(data.table)

prefixes <- c("a", "b", "c")

one.cols <- paste0(prefixes, ".1")
two.cols <- paste0(prefixes, ".2")
result.cols <- paste0(prefixes, ".3")

# Data usually read from file
DT <- data.table(id = LETTERS[1:5],
                 a.1 = 1:5,
                 b.1 = 11:15,
                 c.1 = 21:25,
                 a.2 = 6:10,
                 b.2 = 16:20,
                 c.2 = 26:30)

DT.res <- cbind(DT[,.(id)], 
      result = DT[,..one.cols] - DT[,..two.cols] 
      )

old <- grep(pattern = "result.*", x = colnames(DT.res), value = T)

setnames(DT.res, old = old, new = result.cols)

DT <- DT[DT.res, on = "id"]

# Gives desired result:
print(DT)
#    id a.1 b.1 c.1 a.2 b.2 c.2 a.3 b.3 c.3
# 1:  A   1  11  21   6  16  26  -5  -5  -5
# 2:  B   2  12  22   7  17  27  -5  -5  -5
# 3:  C   3  13  23   8  18  28  -5  -5  -5
# 4:  D   4  14  24   9  19  29  -5  -5  -5
# 5:  E   5  15  25  10  20  30  -5  -5  -5

DT <- data.table(id = LETTERS[1:5],
                 a.1 = 1:5,
                 b.1 = 11:15,
                 c.1 = 21:25,
                 a.2 = 6:10,
                 b.2 = 16:20,
                 c.2 = 26:30)

DT.reshaped <- reshape(DT, direction = "long",
        varying = mapply(FUN = "c", one.cols, two.cols, SIMPLIFY = F)
)

DT.reshaped <- 
  DT.reshaped[, lapply(.SD, 
                       function(x){ x[1] - x[2] }), 
              keyby = .(id), .SDcols = one.cols]

setnames(DT.reshaped, old = one.cols, new = result.cols)

DT <- DT[DT.reshaped, on = "id"]

# Gives desired result, too:    
print(DT)

 

I'd prefer to write something like the following, to get the same result:我宁愿写如下内容,以获得相同的结果:

DT[, (result.cols) := ..one.cols - ..two.cols]

Is there a way to do something like this?有没有办法做这样的事情?

1) gv Using gv in the collapse package we could do this: 1) gv在崩溃 package 中使用 gv 我们可以这样做:

library(collapse)

DT[, (result.cols) := gv(.SD, one.cols) - gv(.SD, two.cols)]

2) gvr We can alternately use the regex variant of gv to eliminate one.cols and two.cols: 2) gvr我们可以交替使用 gv 的正则表达式变体来消除 one.cols 和 two.cols:

library(collapse)

result.cols <- sub(1, 3, gvr(DT, "1$", "names"))
DT[, (result.cols) := gvr(.SD, "1$") - gvr(.SD, "2$")]

3) across Using dplyr we can use across eliminating result.cols as well. 3)使用dplyr,我们也可以使用消除result.cols。

library(dplyr)

DT %>%
  mutate(across(ends_with("1"), .names="{sub(1,3,.col)}") - across(ends_with("2")))

4) data.table If we write it like this it is straight forward in data.table: 4) data.table如果我们这样写,在 data.table 中是直截了当的:

DT[, result.cols] <- DT[, ..one.cols] - DT[, ..two.cols]

or或者

DT[, (result.cols) := .SD[, one.cols, with=FALSE] - .SD[, two.cols, with=FALSE]]

You can use mget and Map to do this:您可以使用mgetMap来执行此操作:

DT[, (result.cols) := Map(`-`, mget(one.cols), mget(two.cols))]

DT
#    id a.1 b.1 c.1 a.2 b.2 c.2 a.3 b.3 c.3
# 1:  A   1  11  21   6  16  26  -5  -5  -5
# 2:  B   2  12  22   7  17  27  -5  -5  -5
# 3:  C   3  13  23   8  18  28  -5  -5  -5
# 4:  D   4  14  24   9  19  29  -5  -5  -5
# 5:  E   5  15  25  10  20  30  -5  -5  -5

But in general, you might want to consider keeping your data in long format for such computations and creating a seprate column for the time (before / after treatment).但一般来说,您可能需要考虑将数据保留为长格式以进行此类计算,并为时间(治疗前/治疗后)创建一个单独的列。

Another (highly flexible) approach is splitting the DT data.table by it's names' characteristics, and then perform the subtraction on the resulting list's elements另一种(高度灵活)的方法是通过名称特征拆分 DT data.table,然后对结果列表的元素执行减法

L <- split.default(DT, gsub( ".*\\.([0-9])", "\\1", names(DT) ) )
DT[, (result.cols) := L$`2` - L$`1`]
#    id a.1 b.1 c.1 a.2 b.2 c.2 a.3 b.3 c.3
# 1:  A   1  11  21   6  16  26   5   5   5
# 2:  B   2  12  22   7  17  27   5   5   5
# 3:  C   3  13  23   8  18  28   5   5   5
# 4:  D   4  14  24   9  19  29   5   5   5
# 5:  E   5  15  25  10  20  30   5   5   5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM