简体   繁体   中英

Efficiently joining more than 2 data.tables

I was wondering if there is a memory efficient way to join n data.tables (or data frames). For example, if I have the following 4 data.tables:

df1 = data.table(group = c(1L,2L,3L),value = rnorm(3),key = "group")
df2 = data.table(group = c(2L,1L,3L),value2 = rnorm(3),key = "group")
df3 = data.table(group = c(3L,2L,1L),value3 = rnorm(3),key = "group")
df4 = data.table(group = c(1L,3L,2L),value4 = rnorm(3),key = "group")

I could merge them like so:

merge(df1,merge(df2,merge(df3,df4)))

but that does not seem like an optimal solution. I might potentially have many data.tables that need to be merged. Is there a way to generalize the above without copying each successive merge to memory? Is there an already accepted way outside of data.table to do this?

Here are some other options you may have, depending on your data. Other options apart from the obvious path of doing a ton of merges, I mean: in a loop, with Reduce or with hadley's join_all / merge_all / wrap_em_all_up .

These are all methods that I have used and found to be faster in my own work, but I don't intend to attempt a general benchmarking case. First, some setup:

DFlist = list(df1,df2,df3,df4)
bycols = key(DFlist[[1]])

I'll assume the tables are all keyed by the bycols .

Stack. If the new cols from each table are somehow related to each other and appear in the same positions in every table, then consider just stacking the data:

DFlong = rbindlist(DFlist, use.names = FALSE, idcol = TRUE)

If for some reason you really want the data in wide format, you can dcast :

dcast(DFlong, 
  formula = sprintf("%s ~ .id", paste(bycols, collapse = "+")), 
  value.var = setdiff(names(DFlong), c(bycols, ".id"))
)

Data.table and R work best with long-format data, though.

Copy cols. If you know that the bycols take all the same values in all of the tables, then just copy over:

DF = DFlist[[1]][, bycols, with=FALSE]
for (k in seq_along(DFlist)){
  newcols = setdiff(names(DFlist[[k]]), bycols)
  DF[, (newcols) := DFlist[[k]][, newcols, with=FALSE]]
}

Merge assign. If some levels of bycols may be missing from certain tables, then make a master table with all combos and do a sequence of merge-assigns:

DF = unique(rbindlist(lapply(DFlist, `[`, j = bycols, with = FALSE)))
for (k in seq_along(DFlist)){
  newcols = setdiff(names(DFlist[[k]]), bycols)
  DF[DFlist[[k]], (newcols) := mget(newcols)]
}

In dplyr:

As your trials all have the same names (and you have scrubbed out the NA's) you can just bind on the rows and summarise.

library(dplyr)

DF <- bind_rows(df1,df2,df3,df4) %>%
    group_by(group) %>%
    summarise_each(funs(na.omit))

Otherwise there is the simple, local minima solution: though at least coding in this dialect saves shaving a few layers off your own onion.

DF <- 
    df1 %>% 
    full_join(df2) %>% 
    full_join(df3) %>% 
    full_join(df4) 

As dplyr runs in C++ not S, it should be faster. I unfortunately am unable to speak to the efficiency of memory usage.

(for similar situations see: R: Updating a data frame with another data frame 's dplyr sol'n )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM