[英]Apply a function across groups and columns in data.table and/or dplyr
I would like to combine two data.tables or dataframes of unequal row #, where the # of rows of dt2 is the same as the number of groups of dt1.我想组合两个不等行的data.tables或dataframes,其中dt2的行数与dt1的组数相同。 Here is a reproducible example:
这是一个可重现的示例:
a <- 1:10; b <- 2:11; c <- 3:12
groupVar <- c(1,1,1,2,2,2,3,3,3,3)
dt1 <- data.table(a,b,c,groupVar)
a2 <- c(10,20,30); b2 <- c(20,30,40); c2 <- c(30,40,50)
dt2 <- data.table(a2,b2,c2)
The real case involves a large number of columns so I with to refer to them with variables.实际情况涉及大量列,因此我使用变量来引用它们。 Using either a loop or apply, I wish to add each row of dt2 to the rows comprising each group of dt1.
使用循环或应用,我希望将 dt2 的每一行添加到组成每组 dt1 的行中。 Here is one of many attempts that fail:
这是失败的众多尝试之一:
for (ic in 1:3) {
c1 <- dt2[,(ic), with=FALSE]
c2 <- dt2[,(ic), with=FALSE]
dt1[,(ic) := .(c1 + c2[.G]), by = "groupVar"]
}
I am interested in how to do this kind of operation "by group and by column" in both data.table syntax and dplyr syntax.我对如何在 data.table 语法和 dplyr 语法中“按组和按列”执行这种操作很感兴趣。 In place (as above) is not critical.
到位(如上所述)并不重要。
desired result:期望的结果:
dt1 (or dt3) =
a b c groupVar
11 22 33 1
12 23 34 1
13 24 35 1
24 35 46 2
...
40 51 62 3
Assuming that the column names are consistent (eg you want a + a2, b + b2...etc), here is a tidyverse solution that starts in a similar way as @dclarson's, then uses the bang-bang operator to select the columns to add up.假设列名是一致的(例如,您想要 a + a2、b + b2...等),这里有一个 tidyverse 解决方案,它以与@dclarson 类似的方式开始,然后使用 bang-bang 运算符来 select 列加起来。
Is this what you are after?这就是你所追求的吗?
## Create tibbles and join
dt1 <- tibble(groupVar,a,b,c)
dt2 <- tibble(groupVar = 1:3,a2,b2,c2)
dt3 <- inner_join(dt1,dt2)
## Define the column starters you are interested in
cols <- c("a","b","c")
## Or in case of many columns
cols <- colnames(dt1[-1])
## Create function to add columns with the same starting letters
add_cols <- function(col){
dt3 %>% select(starts_with(!!col)) %>%
transmute(!!(sym(col)) := !!(sym(col)) + !!(sym(paste0(col,"2"))))
}
## map the function and add groupVar
map_dfc(cols,add_cols) %>% mutate(groupVar = dt3$groupVar)
# A tibble: 10 x 4
a b c groupVar
<dbl> <dbl> <dbl> <dbl>
1 11 22 33 1
2 12 23 34 1
3 13 24 35 1
4 24 35 46 2
5 25 36 47 2
6 26 37 48 2
7 37 48 59 3
8 38 49 60 3
9 39 50 61 3
10 40 51 62 3
The sample datasets provided with the question indicate that the names of the columns may differ between datasets, eg, column b
of dt1
and column b2
of dt2
are supposed to be added.随问题提供的样本数据集表明,数据集之间的列名称可能不同,例如,应该添加
dt1
的b
列和dt2
的b2
列。
Here are two approaches which should be working for an arbitrary number of arbitrarily named pairs of columns:这里有两种方法应该适用于任意数量的任意命名的列对:
get()
get()
更新连接The information on corresponding columns can be provided in a look-up table or translation table :对应列的信息可以在查找表或转换表中提供:
library(data.table)
lut <- data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2"))
lut
vars1 vars2 1: a a2 2: b b2 3: c c2
In cases where column names are treated as data and the column data are of the same data type my first approach is to reshape to long format.在列名被视为数据并且列数据属于相同数据类型的情况下,我的第一种方法是重塑为长格式。
# reshape to long format
mdt1 <- melt(dt1[, rn := .I], measure.vars = lut$vars1)
mdt2 <- melt(dt2[, groupVar := .I], measure.vars = lut$vars2)
# update join to translate variable names
mdt2[lut, on = .(variable = vars2), variable := vars1]
# update join to add corresponding values of both tables
mdt1[mdt2, on = .(groupVar, variable), value := x.value + i.value]
# reshape backe to wide format
dt3 <- dcast(mdt1, rn + groupVar ~ ...)[, rn := NULL][]
dt3
groupVar ab c 1: 1 11 22 33 2: 1 12 23 34 3: 1 13 24 35 4: 2 24 35 46 5: 2 25 36 47 6: 2 26 37 48 7: 3 37 48 59 8: 3 38 49 60 9: 3 39 50 61 10: 3 40 51 62
get()
get()
更新连接Giving a second thought, here is an approach which is similar to OP's proposed for
loop and requires much less coding:再考虑一下,这是一种类似于 OP 提出
for
循环的方法,并且需要更少的编码:
vars1 <- c("a", "b", "c")
vars2 <- c("a2", "b2", "c2")
dt2[, groupVar := .I]
for (iv in seq_along(vars1)) {
dt1[dt2, on = .(groupVar),
(vars1[iv]) := get(paste0("x.", vars1[iv])) + get(paste0("i.", vars2[iv]))][]
}
dt1[]
ab c groupVar 1: 11 22 33 1 2: 12 23 34 1 3: 13 24 35 1 4: 24 35 46 2 5: 25 36 47 2 6: 26 37 48 2 7: 37 48 59 3 8: 38 49 60 3 9: 39 50 61 3 10: 40 51 62 3
Note that dt1
is updated by reference , ie, without copying.请注意,
dt1
是通过引用更新的,即不进行复制。
Prepending the variable names vars1[iv]
by "x."
在变量名
vars1[iv]
加上"x."
and vars2[iv]
by "i."
和
vars2[iv]
通过"i."
on the right hand side of :=
is to ensure that the right columns from dt1
and dt2
, resp., are picked in case of duplicated column names.在
:=
的右侧是确保在列名重复的情况下分别选择dt1
和dt2
的右列。 See the Advanced: section on the j
parameter in help("data.table")
.请参阅
help("data.table")
中有关j
参数的Advanced:部分。
This follows Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server".这遵循了Matt Dowle 的建议,即创建一个要评估的表达式,“类似于构建动态 SQL 语句以发送到服务器”。 See here for another use case.
有关另一个用例,请参见此处。
library(glue) # literal string interpolation
library(magrittr) # piping used to improve readability
EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>%
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{glue("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`({.})][]")} %>%
EVAL()
ab c groupVar 1: 11 22 33 1 2: 12 23 34 1 3: 13 24 35 1 4: 24 35 46 2 5: 25 36 47 2 6: 26 37 48 2 7: 37 48 59 3 8: 38 49 60 3 9: 39 50 61 3 10: 40 51 62 3
It starts with a look-up table which is created on-the-fly and subsequently manipulated to form a complete data.table statement它从一个动态创建的查找表开始,随后对其进行操作以形成完整的 data.table 语句
dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(a = x.a + i.a2, b = x.b + i.b2, c = x.c + i.c2)][]
as a character string.作为字符串。 This string is then evaluated and executed in one go;
然后在一个 go 中评估和执行该字符串; no
for
loops required.不需要
for
循环。
As the helper function EVAL()
already uses paste0()
the call to glue()
can be omitted:由于助手 function
EVAL()
已经使用paste0()
,因此可以省略对glue()
) 的调用:
data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>%
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{EVAL("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(", ., ")][]")}
Note that dot .
请注意点
.
and curly brackets {}
are used with different meaning in different contexts which may appear somewhat confusing.和大括号
{}
在不同的上下文中以不同的含义使用,这可能看起来有些混乱。
It is simple if you add groupVar
to dt2:如果将
groupVar
添加到 dt2,这很简单:
dt2 <- data.table(a2, b2, c2, groupVar=1:3)
dt3 <- merge(dt1, dt2)
dt4 <- with(dt3, data.table(a=a+a2, b=b+b2, c=c+c2, groupVar))
dt4
# a b c groupVar
# 1: 11 22 33 1
# 2: 12 23 34 1
# 3: 13 24 35 1
# 4: 24 35 46 2
# 5: 25 36 47 2
# 6: 26 37 48 2
# 7: 37 48 59 3
# 8: 38 49 60 3
# 9: 39 50 61 3
# 10: 40 51 62 3
This should solve your desire:这应该可以解决您的愿望:
groupVar
in dt2
with unique
groupVar
from dt1
dt2
中使用来自dt1
的unique
groupVar
创建一个groupVar
right_join
by groupVar
right_join
groupVar
a
, b
, c
with mutate
mutate
创建新列a
、 b
、 c
a
, b
, c
and groupVar
as desired outputa
, b
, c
和groupVar
outputlibrary(dplyr)
dt3 <- dt2 %>%
mutate(groupVar = unique(dt1$groupVar)) %>%
right_join(dt1, by="groupVar") %>%
mutate(a = a + a2,
b = b + b2,
c = c + c2) %>%
select(a, b, c, groupVar)
data:数据:
library(data.table)
a <- 1:10; b <- 2:11; c <- 3:12
groupVar <- c(1,1,1,2,2,2,3,3,3,3)
dt1 <- data.table(a,b,c,groupVar)
a2 <- c(10,20,30); b2 <- c(20,30,40); c2 <- c(30,40,50)
dt2 <- data.table(a2,b2,c2)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.