有效地加入2个以上的data.tables

Question

I was wondering if there is a memory efficient way to join n data.tables (or data frames). 我想知道是否有一种内存有效的方式来加入n data.tables（或数据帧）。 For example, if I have the following 4 data.tables: 例如，如果我有以下4个data.tables：

df1 = data.table(group = c(1L,2L,3L),value = rnorm(3),key = "group")
df2 = data.table(group = c(2L,1L,3L),value2 = rnorm(3),key = "group")
df3 = data.table(group = c(3L,2L,1L),value3 = rnorm(3),key = "group")
df4 = data.table(group = c(1L,3L,2L),value4 = rnorm(3),key = "group")

I could merge them like so: 我可以像这样合并它们：

merge(df1,merge(df2,merge(df3,df4)))

but that does not seem like an optimal solution. 但这似乎不是一个最佳解决方案。 I might potentially have many data.tables that need to be merged. 我可能有许多需要合并的data.tables。 Is there a way to generalize the above without copying each successive merge to memory? 有没有办法概括上述内容而不将每个连续的合并复制到内存？ Is there an already accepted way outside of data.table to do this? 在data.table之外是否有一种已经被接受的方法可以做到这一点？

Answer 1

Here are some other options you may have, depending on your data. 根据您的数据，以下是您可能拥有的其他一些选项。 Other options apart from the obvious path of doing a ton of merges, I mean: in a loop, with Reduce or with hadley's join_all / merge_all / wrap_em_all_up . 除了显而易见的大量合并之外的其他选择，我的意思是：在一个循环中，使用Reduce或者使用hadley的join_all / merge_all / wrap_em_all_up 。

These are all methods that I have used and found to be faster in my own work, but I don't intend to attempt a general benchmarking case. 这些都是我使用的方法，并且发现在我自己的工作中更快，但我不打算尝试一般的基准测试案例。 First, some setup: 首先，一些设置：

DFlist = list(df1,df2,df3,df4)
bycols = key(DFlist[[1]])

I'll assume the tables are all keyed by the bycols . 我假设这些表都是由bycols键入的。

Stack. 堆。 If the new cols from each table are somehow related to each other and appear in the same positions in every table, then consider just stacking the data: 如果每个表中的新cols以某种方式彼此相关并且出现在每个表中的相同位置，那么考虑只是堆叠数据：

DFlong = rbindlist(DFlist, use.names = FALSE, idcol = TRUE)

If for some reason you really want the data in wide format, you can dcast : 如果由于某种原因你真的想要宽格式的数据，你可以dcast ：

dcast(DFlong, 
  formula = sprintf("%s ~ .id", paste(bycols, collapse = "+")), 
  value.var = setdiff(names(DFlong), c(bycols, ".id"))
)

Data.table and R work best with long-format data, though. 但是，Data.table和R最适合使用长格式数据。

Copy cols. 复制cols。 If you know that the bycols take all the same values in all of the tables, then just copy over: 如果您知道bycols在所有表中采用所有相同的值，那么只需复制：

DF = DFlist[[1]][, bycols, with=FALSE]
for (k in seq_along(DFlist)){
  newcols = setdiff(names(DFlist[[k]]), bycols)
  DF[, (newcols) := DFlist[[k]][, newcols, with=FALSE]]
}

Merge assign. 合并分配。 If some levels of bycols may be missing from certain tables, then make a master table with all combos and do a sequence of merge-assigns: 如果某些表中可能缺少某些级别的bycols ，则创建包含所有组合的主表并执行一系列merge-assigns：

DF = unique(rbindlist(lapply(DFlist, `[`, j = bycols, with = FALSE)))
for (k in seq_along(DFlist)){
  newcols = setdiff(names(DFlist[[k]]), bycols)
  DF[DFlist[[k]], (newcols) := mget(newcols)]
}

Answer 2

In dplyr: 在dplyr中：

As your trials all have the same names (and you have scrubbed out the NA's) you can just bind on the rows and summarise. 由于您的试验都具有相同的名称（并且您已经清除了NA），您可以绑定行并进行汇总。

library(dplyr)

DF <- bind_rows(df1,df2,df3,df4) %>%
    group_by(group) %>%
    summarise_each(funs(na.omit))

Otherwise there is the simple, local minima solution: though at least coding in this dialect saves shaving a few layers off your own onion. 除此之外，还有一个简单的局部最小解决方案：尽管至少用这种方言编码可以节省你自己洋葱的几层。

DF <- 
    df1 %>% 
    full_join(df2) %>% 
    full_join(df3) %>% 
    full_join(df4)

As dplyr runs in C++ not S, it should be faster. 由于dplyr在C ++中运行而不是S，它应该更快。 I unfortunately am unable to speak to the efficiency of memory usage. 遗憾的是，我无法说出内存使用效率。

(for similar situations see: R: Updating a data frame with another data frame 's dplyr sol'n ) （对于类似的情况，请参阅： R：用另一个数据帧的dplyr sol'n 更新数据帧 ）

有效地加入2个以上的data.tables

问题描述

2 个解决方案

解决方案1
5 已采纳 2016-06-22 17:24:50

解决方案2
0 2016-06-22 17:50:14

有效地加入2个以上的data.tables

问题描述

2 个解决方案

解决方案1 5 已采纳 2016-06-22 17:24:50

解决方案2 0 2016-06-22 17:50:14

解决方案1
5 已采纳 2016-06-22 17:24:50

解决方案2
0 2016-06-22 17:50:14