简体   繁体   English

快速合并两个data.tables - 并行化或data.table

[英]Quickly merging two data.tables - parallelization or data.table

I'm trying to merge two data.tables, both of which are around 60-80 million rows in length.我正在尝试合并两个 data.tables,它们的长度都在 60-80 百万行左右。 I know that data.table is already built so that it is very adept at merging, but I'm wondering for data of this size is data.table still more efficient than potentially parallelizing it, especially since I have access to a computing cluster.我知道 data.table 已经构建,因此它非常擅长合并,但我想知道这种大小的数据是 data.table 仍然比潜在地并行化它更有效,特别是因为我可以访问计算集群。

This is what I'm currently doing.这就是我目前正在做的事情。

setorder(fcc_temp, BlockCode)
setorder(BlockCode, block_fips)
fcc_temp[block_data_long, c("pop", "tract") := list(pop, tract), 
           on = c(BlockCode="block_fips", year="year")]

From your example we don't see much details like data types.从您的示例中,我们看不到数据类型之类的详细信息。

data.table join is currently single threaded. data.table 连接当前是单线程的。 Some small parts of it uses multiple cores but this is AFAIR only finding order in the join columns.它的一些小部分使用多个核心,但这是 AFAIR 仅在连接列中查找顺序。 Computing matches goes in single thread.计算匹配在单线程中进行。

Have in mind that parallelizing join is non trivial and will not scale that well as many other operations, so potential gains are much lesser than for in terms of grouping.请记住,并行化连接并非微不足道,并且不会像许多其他操作那样扩展,因此在分组方面的潜在收益要小得多。

Anyway this computing matches is still super fast.无论如何,这种计算匹配仍然非常快。 We run a benchmark where we compare join, one of the questions (question 5) is "big to big join" which seems to correspond to your scenario.我们运行一个基准来比较连接,其中一个问题(问题 5)是“从大到大连接”,这似乎与您的场景相对应。 https://h2oai.github.io/db-benchmark/ Below is 100M data size for join task. https://h2oai.github.io/db-benchmark/下面是连接任务的 100M 数据大小。 Q5 is a join of 100M LHS to 100M RHS: Q5 是 100M LHS 到 100M RHS 的连接: 在此处输入图像描述

You can see that data.table is pretty much on the top.您可以看到 data.table 几乎在顶部。 Note that we are joining on single integer column there, so there is likely to be a difference to your scenario where you join on two columns.请注意,我们在那里加入单个 integer 列,因此您加入两列的情况可能会有所不同。 Benchmark does not take into account possibility to pre-sort data. Benchmark 没有考虑对数据进行预排序的可能性。 Try using setkey (instead of setorder ) on your tables to sort them by join columns.尝试在您的表上使用setkey (而不是setorder )按连接列对它们进行排序。 To be fair I believe it might not be easy to beat this kind of setup.公平地说,我相信击败这种设置可能并不容易。

In future version computing matches of join will be parallelized as well, a drafts of that is already in repository.在未来的版本中,join 的计算匹配也将被并行化,其草稿已经在存储库中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM