当使用相同的密钥集创建两个不同的Spark Pair RDD时，Spark会将具有相同密钥的分区分配给同一台机器吗？

Question

I want to do a join operation between two very big key-value pair RDDs. 我想在两个非常大的键值对RDD之间进行连接操作。 The keys of these two RDD comes from the same set. 这两个RDD的键来自同一组。 To reduce data shuffle, I wish I could add a pre-distribute phase so that partitions with the same key will be distributed on the same machine. 为了减少数据混乱，我希望我可以添加预分配阶段，以便具有相同密钥的分区将分布在同一台机器上。 Hopefully this could reduce some shuffle time. 希望这可以减少一些洗牌时间。

I want to know is spark smart enough to do that for me or I have to implement this logic myself? 我想知道火花足够智能为我做这个或我必须自己实现这个逻辑？

I know when I join two RDD, one preprocess with partitionBy . 我知道当我加入两个RDD时，一个预处理与partitionBy 。 Spark is smart enough to use this information and only shuffle the other RDD. Spark非常聪明，可以使用这些信息，只能改变其他RDD。 But I don't know what will happen if I use partitionBy on two RDD at the same time and then do the join. 但我不知道如果我同时在两个RDD上使用partitionBy然后进行连接会发生什么。

Answer 1

If you use the same partitioner for both RDDs you achieve co-partitioning of your data sets. 如果对两个RDD使用相同的分区程序，则可以实现数据集的共同分区。 That does not necessarily mean that your RDDs are co-located - that is, that the partitioned data is located on the same node. 这并不一定意味着您的RDD位于同一位置 - 也就是说，分区数据位于同一节点上。

Nevertheless, the performance should be better as if both RDDs would have different partitioner. 然而，性能应该更好，就好像两个RDD都有不同的分区。

Answer 2

I have seen this, Speeding Up Joins by Assigning a Known Partitioner that would be helpful to understand the effect of using the same partitioner for both RDDs; 我已经看到了这一点，通过分配一个已知的分区器来加速连接，这将有助于理解为两个RDD使用相同分区器的效果;

 Speeding Up Joins by Assigning a Known Partitioner 
If you have to do an operation before the join that requires a shuffle, such as aggregateByKey or reduceByKey, you can prevent the shuffle by adding a hash partitioner with the same number of partitions as an explicit argument to the first operation and persisting the RDD before the join. 如果必须在需要shuffle的连接之前执行操作，例如aggregateByKey或reduceByKey，则可以通过添加具有与第一个操作的显式参数相同数量的分区的哈希分区器并在之前保留RDD来防止shuffle加入。

当使用相同的密钥集创建两个不同的Spark Pair RDD时，Spark会将具有相同密钥的分区分配给同一台机器吗？

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-12-19 12:03:46

解决方案2
0 2016-10-03 15:42:17

当使用相同的密钥集创建两个不同的Spark Pair RDD时，Spark会将具有相同密钥的分区分配给同一台机器吗？

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-12-19 12:03:46

解决方案2 0 2016-10-03 15:42:17

解决方案1
3 已采纳 2015-12-19 12:03:46

解决方案2
0 2016-10-03 15:42:17