简体   繁体   English

用 Apache Spark 连接两个大表

[英]Join two big tables with Apache Spark

I want to join 2 very big tables by specific mutual key using Spark, I try to understand what is the optimal way to do that.我想使用 Spark 通过特定的互键加入 2 个非常大的表,我试图了解这样做的最佳方法是什么。

Let's say, for the example:比方说,例如:

  • table 1 contains 900M rows and ~100 columns表 1 包含 900M 行和 ~100 列
  • table 2 contains 600M rows and ~200 columns.表 2 包含 6 亿行和约 200 列。
  • We can't use "broadcast join", the tables are big and can't be broadcast.我们不能使用“广播连接”,表很大,不能广播。

I want to join (inner join) the tables using the mutual 'id' columns that exists in both of them, in addition, I know that the id columns contains the same values in both of the tables, there is no id value that exists in one but doesn't exist in the other.我想使用两个表中都存在的相互“id”列来加入(内连接)表,此外,我知道 id 列在两个表中包含相同的值,不存在 id 值在一个中,但在另一个中不存在。

The ideal way I can think of is to "divide" each one of my tables into partitions/buckets that contains the same 'id' values and to send them to the same executor that will calculate the join result with minimum data shuffling in the cluster.我能想到的理想方法是将我的每个表“划分”为包含相同“id”值的分区/存储桶,并将它们发送到同一个执行程序,该执行程序将在集群中以最少的数据混洗计算连接结果.

My questions are:我的问题是:

  1. If I use for example.repartition(5, 'id') for each one of the tables - each one of the 5 partitions will contain the same 'id' values?如果我为每个表使用 example.repartition(5, 'id') - 5 个分区中的每一个都将包含相同的 'id' 值? (as long as we have the same 'id' values in both of them) (只要我们在它们中都有相同的 'id' 值)

for example:例如:

df1
+---+---+------+
|age| id|  name|
+---+---+------+
|  5|  1| David|
| 50|  2|  Lily|
| 10|  3|   Dan|
| 15|  4|Nicole|
| 16|  5|  Dana|
| 19|  6|   Ron|
| 20|  7| Alice|
| 22|  8|  Nora|
| 45|  9|  Sara|
| 70| 10| Aaron|
+---+---+------+


df2
+---+-----+
| id|price|
+---+-----+
|  1| 30.8|
|  1| 40.3|
|  2|100.0|
|  2| 30.1|
|  3| 99.0|
|  3|102.0|
|  4| 81.2|
|  4| 91.2|
|  5| 73.4|
|  6| 22.2|
|  7|374.4|
|  8|669.7|
|  9|  4.8|
| 10|35.38|
+---+-----+

df1.repartition(5,'id')
df2.repartition(5,'id')

If df1 partitions are: [id=1,id=2],[id=3,id=4],[id=5,id=6],[id=7,id=8],[id=9,id=10]如果 df1 分区是:[id=1,id=2],[id=3,id=4],[id=5,id=6],[id=7,id=8],[id=9, id=10]

Is it necessarily be the same for df2? df2 一定是一样的吗?

  1. If I use 'bucketBy' in the same way, will I get the same 'id' values in the buckets of the tables?如果我以相同的方式使用“bucketBy”,我会在表的存储桶中获得相同的“id”值吗?

  2. Will spark send the right partitions to the same executor? spark 会将正确的分区发送到同一个执行程序吗? I mean, the partition that contains [id=1,id=2] of table 1 and the partition that contains [id=1,id=2] for table 2 will be sent to the same executor for the join.我的意思是,包含表 1 的 [id=1,id=2] 的分区和包含表 2 的 [id=1,id=2] 的分区将被发送到同一个执行器进行连接。

If I miss something, or you can recommend another way to join 2 big tables under the assumptions I mentioned, it will be very helpful.如果我错过了什么,或者您可以根据我提到的假设推荐另一种方法来加入 2 个大表,这将非常有帮助。

Take a look at this answer .看看这个答案
TLDR: If you want to join them once and its the only aim for re-partitioning, just simply join them. TLDR:如果你想加入他们一次并且它的唯一目的是重新分区,只需简单地加入他们。

Yes it would have to be like that, else the whole paradigm of JOINing would not be reliable.是的,它必须是这样的,否则整个 JOINing 范式将不可靠。

You mean actually Worker - the machine with Executor(s).你的意思实际上是 Worker - 带有 Executor(s) 的机器。

repartition on its own would not be advisable as round-robin.不建议单独进行重新分区作为循环。

Range partitioning works as well.范围分区也有效。 Checked to be sure, but assume same distribution of partitioning values as proviso.检查以确保,但假设分区值的分布与附带条件相同。

It all works on the premise of lazy evaluation.这一切都在惰性评估的前提下工作。

bucketBy can be used - but is more for persisting to disk and using in next App.可以使用 bucketBy - 但更多的是用于持久化到磁盘并在下一个应用程序中使用。

Again you need not worry about assisting as lazy eval means there is chance for the Optimizer to work it all out - to which Worker to allocate.同样,您不必担心协助,因为惰性评估意味着优化器有机会将其全部解决 - 分配给哪个 Worker。 But that is at a lower level of detail, abstraction.但那是在较低的细节层次上,抽象。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM