I would like to combine two data.tables or dataframes of unequal row #, where the # of rows of dt2 is the same as the number of groups of dt1. Here is a reproducible example:
a <- 1:10; b <- 2:11; c <- 3:12
groupVar <- c(1,1,1,2,2,2,3,3,3,3)
dt1 <- data.table(a,b,c,groupVar)
a2 <- c(10,20,30); b2 <- c(20,30,40); c2 <- c(30,40,50)
dt2 <- data.table(a2,b2,c2)
The real case involves a large number of columns so I with to refer to them with variables. Using either a loop or apply, I wish to add each row of dt2 to the rows comprising each group of dt1. Here is one of many attempts that fail:
for (ic in 1:3) {
c1 <- dt2[,(ic), with=FALSE]
c2 <- dt2[,(ic), with=FALSE]
dt1[,(ic) := .(c1 + c2[.G]), by = "groupVar"]
}
I am interested in how to do this kind of operation "by group and by column" in both data.table syntax and dplyr syntax. In place (as above) is not critical.
desired result:
dt1 (or dt3) =
a b c groupVar
11 22 33 1
12 23 34 1
13 24 35 1
24 35 46 2
...
40 51 62 3
Assuming that the column names are consistent (eg you want a + a2, b + b2...etc), here is a tidyverse solution that starts in a similar way as @dclarson's, then uses the bang-bang operator to select the columns to add up.
Is this what you are after?
## Create tibbles and join
dt1 <- tibble(groupVar,a,b,c)
dt2 <- tibble(groupVar = 1:3,a2,b2,c2)
dt3 <- inner_join(dt1,dt2)
## Define the column starters you are interested in
cols <- c("a","b","c")
## Or in case of many columns
cols <- colnames(dt1[-1])
## Create function to add columns with the same starting letters
add_cols <- function(col){
dt3 %>% select(starts_with(!!col)) %>%
transmute(!!(sym(col)) := !!(sym(col)) + !!(sym(paste0(col,"2"))))
}
## map the function and add groupVar
map_dfc(cols,add_cols) %>% mutate(groupVar = dt3$groupVar)
# A tibble: 10 x 4
a b c groupVar
<dbl> <dbl> <dbl> <dbl>
1 11 22 33 1
2 12 23 34 1
3 13 24 35 1
4 24 35 46 2
5 25 36 47 2
6 26 37 48 2
7 37 48 59 3
8 38 49 60 3
9 39 50 61 3
10 40 51 62 3
The sample datasets provided with the question indicate that the names of the columns may differ between datasets, eg, column b
of dt1
and column b2
of dt2
are supposed to be added.
Here are two approaches which should be working for an arbitrary number of arbitrarily named pairs of columns:
get()
The information on corresponding columns can be provided in a look-up table or translation table :
library(data.table)
lut <- data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2"))
lut
vars1 vars2 1: a a2 2: b b2 3: c c2
In cases where column names are treated as data and the column data are of the same data type my first approach is to reshape to long format.
# reshape to long format
mdt1 <- melt(dt1[, rn := .I], measure.vars = lut$vars1)
mdt2 <- melt(dt2[, groupVar := .I], measure.vars = lut$vars2)
# update join to translate variable names
mdt2[lut, on = .(variable = vars2), variable := vars1]
# update join to add corresponding values of both tables
mdt1[mdt2, on = .(groupVar, variable), value := x.value + i.value]
# reshape backe to wide format
dt3 <- dcast(mdt1, rn + groupVar ~ ...)[, rn := NULL][]
dt3
groupVar ab c 1: 1 11 22 33 2: 1 12 23 34 3: 1 13 24 35 4: 2 24 35 46 5: 2 25 36 47 6: 2 26 37 48 7: 3 37 48 59 8: 3 38 49 60 9: 3 39 50 61 10: 3 40 51 62
get()
Giving a second thought, here is an approach which is similar to OP's proposed for
loop and requires much less coding:
vars1 <- c("a", "b", "c")
vars2 <- c("a2", "b2", "c2")
dt2[, groupVar := .I]
for (iv in seq_along(vars1)) {
dt1[dt2, on = .(groupVar),
(vars1[iv]) := get(paste0("x.", vars1[iv])) + get(paste0("i.", vars2[iv]))][]
}
dt1[]
ab c groupVar 1: 11 22 33 1 2: 12 23 34 1 3: 13 24 35 1 4: 24 35 46 2 5: 25 36 47 2 6: 26 37 48 2 7: 37 48 59 3 8: 38 49 60 3 9: 39 50 61 3 10: 40 51 62 3
Note that dt1
is updated by reference , ie, without copying.
Prepending the variable names vars1[iv]
by "x."
and vars2[iv]
by "i."
on the right hand side of :=
is to ensure that the right columns from dt1
and dt2
, resp., are picked in case of duplicated column names. See the Advanced: section on the j
parameter in help("data.table")
.
This follows Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server". See here for another use case.
library(glue) # literal string interpolation
library(magrittr) # piping used to improve readability
EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>%
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{glue("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`({.})][]")} %>%
EVAL()
ab c groupVar 1: 11 22 33 1 2: 12 23 34 1 3: 13 24 35 1 4: 24 35 46 2 5: 25 36 47 2 6: 26 37 48 2 7: 37 48 59 3 8: 38 49 60 3 9: 39 50 61 3 10: 40 51 62 3
It starts with a look-up table which is created on-the-fly and subsequently manipulated to form a complete data.table statement
dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(a = x.a + i.a2, b = x.b + i.b2, c = x.c + i.c2)][]
as a character string. This string is then evaluated and executed in one go; no for
loops required.
As the helper function EVAL()
already uses paste0()
the call to glue()
can be omitted:
data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>%
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{EVAL("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(", ., ")][]")}
Note that dot .
and curly brackets {}
are used with different meaning in different contexts which may appear somewhat confusing.
It is simple if you add groupVar
to dt2:
dt2 <- data.table(a2, b2, c2, groupVar=1:3)
dt3 <- merge(dt1, dt2)
dt4 <- with(dt3, data.table(a=a+a2, b=b+b2, c=c+c2, groupVar))
dt4
# a b c groupVar
# 1: 11 22 33 1
# 2: 12 23 34 1
# 3: 13 24 35 1
# 4: 24 35 46 2
# 5: 25 36 47 2
# 6: 26 37 48 2
# 7: 37 48 59 3
# 8: 38 49 60 3
# 9: 39 50 61 3
# 10: 40 51 62 3
This should solve your desire:
groupVar
in dt2
with unique
groupVar
from dt1
right_join
by groupVar
a
, b
, c
with mutate
a
, b
, c
and groupVar
as desired outputlibrary(dplyr)
dt3 <- dt2 %>%
mutate(groupVar = unique(dt1$groupVar)) %>%
right_join(dt1, by="groupVar") %>%
mutate(a = a + a2,
b = b + b2,
c = c + c2) %>%
select(a, b, c, groupVar)
data:
library(data.table)
a <- 1:10; b <- 2:11; c <- 3:12
groupVar <- c(1,1,1,2,2,2,3,3,3,3)
dt1 <- data.table(a,b,c,groupVar)
a2 <- c(10,20,30); b2 <- c(20,30,40); c2 <- c(30,40,50)
dt2 <- data.table(a2,b2,c2)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.