Efficiently joining more than 2 data.tables

Question

I was wondering if there is a memory efficient way to join n data.tables (or data frames). For example, if I have the following 4 data.tables:

df1 = data.table(group = c(1L,2L,3L),value = rnorm(3),key = "group")
df2 = data.table(group = c(2L,1L,3L),value2 = rnorm(3),key = "group")
df3 = data.table(group = c(3L,2L,1L),value3 = rnorm(3),key = "group")
df4 = data.table(group = c(1L,3L,2L),value4 = rnorm(3),key = "group")

I could merge them like so:

merge(df1,merge(df2,merge(df3,df4)))

but that does not seem like an optimal solution. I might potentially have many data.tables that need to be merged. Is there a way to generalize the above without copying each successive merge to memory? Is there an already accepted way outside of data.table to do this?

Answer 1

Here are some other options you may have, depending on your data. Other options apart from the obvious path of doing a ton of merges, I mean: in a loop, with Reduce or with hadley's join_all / merge_all / wrap_em_all_up .

These are all methods that I have used and found to be faster in my own work, but I don't intend to attempt a general benchmarking case. First, some setup:

DFlist = list(df1,df2,df3,df4)
bycols = key(DFlist[[1]])

I'll assume the tables are all keyed by the bycols .

Stack. If the new cols from each table are somehow related to each other and appear in the same positions in every table, then consider just stacking the data:

DFlong = rbindlist(DFlist, use.names = FALSE, idcol = TRUE)

If for some reason you really want the data in wide format, you can dcast :

dcast(DFlong, 
  formula = sprintf("%s ~ .id", paste(bycols, collapse = "+")), 
  value.var = setdiff(names(DFlong), c(bycols, ".id"))
)

Data.table and R work best with long-format data, though.

Copy cols. If you know that the bycols take all the same values in all of the tables, then just copy over:

DF = DFlist[[1]][, bycols, with=FALSE]
for (k in seq_along(DFlist)){
  newcols = setdiff(names(DFlist[[k]]), bycols)
  DF[, (newcols) := DFlist[[k]][, newcols, with=FALSE]]
}

Merge assign. If some levels of bycols may be missing from certain tables, then make a master table with all combos and do a sequence of merge-assigns:

DF = unique(rbindlist(lapply(DFlist, `[`, j = bycols, with = FALSE)))
for (k in seq_along(DFlist)){
  newcols = setdiff(names(DFlist[[k]]), bycols)
  DF[DFlist[[k]], (newcols) := mget(newcols)]
}

Answer 2

In dplyr:

As your trials all have the same names (and you have scrubbed out the NA's) you can just bind on the rows and summarise.

library(dplyr)

DF <- bind_rows(df1,df2,df3,df4) %>%
    group_by(group) %>%
    summarise_each(funs(na.omit))

Otherwise there is the simple, local minima solution: though at least coding in this dialect saves shaving a few layers off your own onion.

DF <- 
    df1 %>% 
    full_join(df2) %>% 
    full_join(df3) %>% 
    full_join(df4)

As dplyr runs in C++ not S, it should be faster. I unfortunately am unable to speak to the efficiency of memory usage.

(for similar situations see: R: Updating a data frame with another data frame 's dplyr sol'n )

Efficiently joining more than 2 data.tables

Question

2 answers

solution1
5 ACCPTED 2016-06-22 17:24:50

solution2
0 2016-06-22 17:50:14

Efficiently joining more than 2 data.tables

Question

2 answers

solution1 5 ACCPTED 2016-06-22 17:24:50

solution2 0 2016-06-22 17:50:14

solution1
5 ACCPTED 2016-06-22 17:24:50

solution2
0 2016-06-22 17:50:14