简体   繁体   English

有效地加入2个以上的data.tables

[英]Efficiently joining more than 2 data.tables

I was wondering if there is a memory efficient way to join n data.tables (or data frames). 我想知道是否有一种内存有效的方式来加入n data.tables(或数据帧)。 For example, if I have the following 4 data.tables: 例如,如果我有以下4个data.tables:

df1 = data.table(group = c(1L,2L,3L),value = rnorm(3),key = "group")
df2 = data.table(group = c(2L,1L,3L),value2 = rnorm(3),key = "group")
df3 = data.table(group = c(3L,2L,1L),value3 = rnorm(3),key = "group")
df4 = data.table(group = c(1L,3L,2L),value4 = rnorm(3),key = "group")

I could merge them like so: 我可以像这样合并它们:

merge(df1,merge(df2,merge(df3,df4)))

but that does not seem like an optimal solution. 但这似乎不是一个最佳解决方案。 I might potentially have many data.tables that need to be merged. 我可能有许多需要合并的data.tables。 Is there a way to generalize the above without copying each successive merge to memory? 有没有办法概括上述内容而不将每个连续的合并复制到内存? Is there an already accepted way outside of data.table to do this? 在data.table之外是否有一种已经被接受的方法可以做到这一点?

Here are some other options you may have, depending on your data. 根据您的数据,以下是您可能拥有的其他一些选项。 Other options apart from the obvious path of doing a ton of merges, I mean: in a loop, with Reduce or with hadley's join_all / merge_all / wrap_em_all_up . 除了显而易见的大量合并之外的其他选择,我的意思是:在一个循环中,使用Reduce或者使用hadley的join_all / merge_all / wrap_em_all_up

These are all methods that I have used and found to be faster in my own work, but I don't intend to attempt a general benchmarking case. 这些都是我使用的方法,并且发现在我自己的工作中更快,但我不打算尝试一般的基准测试案例。 First, some setup: 首先,一些设置:

DFlist = list(df1,df2,df3,df4)
bycols = key(DFlist[[1]])

I'll assume the tables are all keyed by the bycols . 我假设这些表都是由bycols键入的。

Stack. 堆。 If the new cols from each table are somehow related to each other and appear in the same positions in every table, then consider just stacking the data: 如果每个表中的新cols以某种方式彼此相关并且出现在每个表中的相同位置,那么考虑只是堆叠数据:

DFlong = rbindlist(DFlist, use.names = FALSE, idcol = TRUE)

If for some reason you really want the data in wide format, you can dcast : 如果由于某种原因你真的想要宽格式的数据,你可以dcast

dcast(DFlong, 
  formula = sprintf("%s ~ .id", paste(bycols, collapse = "+")), 
  value.var = setdiff(names(DFlong), c(bycols, ".id"))
)

Data.table and R work best with long-format data, though. 但是,Data.table和R最适合使用长格式数据。

Copy cols. 复制cols。 If you know that the bycols take all the same values in all of the tables, then just copy over: 如果您知道bycols在所有表中采用所有相同的值,那么只需复制:

DF = DFlist[[1]][, bycols, with=FALSE]
for (k in seq_along(DFlist)){
  newcols = setdiff(names(DFlist[[k]]), bycols)
  DF[, (newcols) := DFlist[[k]][, newcols, with=FALSE]]
}

Merge assign. 合并分配。 If some levels of bycols may be missing from certain tables, then make a master table with all combos and do a sequence of merge-assigns: 如果某些表中可能缺少某些级别的bycols ,则创建包含所有组合的主表并执行一系列merge-assigns:

DF = unique(rbindlist(lapply(DFlist, `[`, j = bycols, with = FALSE)))
for (k in seq_along(DFlist)){
  newcols = setdiff(names(DFlist[[k]]), bycols)
  DF[DFlist[[k]], (newcols) := mget(newcols)]
}

In dplyr: 在dplyr中:

As your trials all have the same names (and you have scrubbed out the NA's) you can just bind on the rows and summarise. 由于您的试验都具有相同的名称(并且您已经清除了NA),您可以绑定行并进行汇总。

library(dplyr)

DF <- bind_rows(df1,df2,df3,df4) %>%
    group_by(group) %>%
    summarise_each(funs(na.omit))

Otherwise there is the simple, local minima solution: though at least coding in this dialect saves shaving a few layers off your own onion. 除此之外,还有一个简单的局部最小解决方案:尽管至少用这种方言编码可以节省你自己洋葱的几层。

DF <- 
    df1 %>% 
    full_join(df2) %>% 
    full_join(df3) %>% 
    full_join(df4) 

As dplyr runs in C++ not S, it should be faster. 由于dplyr在C ++中运行而不是S,它应该更快。 I unfortunately am unable to speak to the efficiency of memory usage. 遗憾的是,我无法说出内存使用效率。

(for similar situations see: R: Updating a data frame with another data frame 's dplyr sol'n ) (对于类似的情况,请参阅: R:用另一个数据帧的dplyr sol'n 更新数据帧

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM