在Apache Spark协作组中，如何确保不移动大于2个操作数的1个RDD？

Question

In a cogroup transformation, eg RDD1.cogroup(RDD2, ...), I used to assume that Spark only shuffles/moves RDD2 and retains RDD1's partitioning and in-memory storage if: 在协同分组转换中，例如RDD1.cogroup（RDD2，...），我曾经假设Spark在以下情况下只会随机播放/移动RDD2并保留RDD1的分区和内存中存储：

RDD1 has an explicit partitioner RDD1有一个明确的分区程序
RDD1 is cached. RDD1已缓存。

In my other projects most of the shuffling behaviour seems to be consistent with this assumption. 在我的其他项目中，大多数改组行为似乎都与该假设一致。 So yesterday I wrote a short scala program to prove it once and for all: 所以昨天我写了一个简短的scala程序来一劳永逸地证明它：

// sc is the SparkContext
val rdd1 = sc.parallelize(1 to 10, 4).map(v => v->v)
  .partitionBy(new HashPartitioner(4))
rdd1.persist().count()
val rdd2 = sc.parallelize(1 to 10, 4).map(v => (11-v)->v)

val cogrouped = rdd1.cogroup(rdd2).map {
  v =>
    v._2._1.head -> v._2._2.head
}

val zipped = cogrouped.zipPartitions(rdd1, rdd2) {
  (itr1, itr2, itr3) =>
    itr1.zipAll(itr2.map(_._2), 0->0, 0).zipAll(itr3.map(_._2), (0->0)->0, 0)
      .map {
        v =>
          (v._1._1._1, v._1._1._2, v._1._2, v._2)
      }
}

zipped.collect().foreach(println)

If rdd1 doesn't move the first column of zipped should have the same value as the third column, so I ran the programs, oops: 如果rdd1不移动，则压缩的第一列应与第三列具有相同的值，所以我运行了程序，哎呀：

(4,7,4,1)
(8,3,8,2)
(1,10,1,3)
(9,2,5,4)
(5,6,9,5)
(6,5,2,6)
(10,1,6,7)
(2,9,10,0)
(3,8,3,8)
(7,4,7,9)
(0,0,0,10)

The assumption is not true. 该假设是不正确的。 Spark probably did some internal optimisation and decided that regenerating rdd1's partitions is much faster than keeping them in cache. Spark可能进行了一些内部优化，并决定重新生成rdd1的分区比将其保留在缓存中要快得多。

So the question is: If my programmatic requirement to not move RDD1 (and keep it cached) is because of other reasons than speed (eg resource locality), or in some occasions Spark internal optimisation is not preferrable, is there a way to explicitly instruct the framework to not move an operand in all cogroup-like operations? 因此，问题是：如果我的编程要求不移动RDD1（并使其保持高速缓存）是由于速度以外的其他原因（例如资源局部性），或者在某些情况下Spark内部优化不是可取的，那么有没有一种方法可以明确地指示在所有类似cogroup的操作中不移动操作数的框架？ This also include join, outer join, and groupWith. 这也包括联接，外部联接和groupWith。

Thanks a lot for your help. 非常感谢你的帮助。 So far I'm using broadcast join as a not-so-scalable makeshift solution, it is not going to last long before crashing my cluster. 到目前为止，我使用广播联接作为一种不太可扩展的临时解决方案，它在崩溃集群之前不会持续很长时间。 I'm expecting a solution consistent with the distributed computing principal. 我期待一个与分布式计算原理一致的解决方案。

Answer 1

If rdd1 doesn't move the first column of zipped should have the same value as the third column 如果rdd1不移动，则压缩的第一列应与第三列具有相同的值

This assumption is just incorrect. 这个假设是不正确的。 Creating CoGroupedRDD is not only about shuffle, but also about generating internal structures required for matching corresponding records. 创建CoGroupedRDD不仅与CoGroupedRDD有关，而且与生成匹配相应记录所需的内部结构有关。 Internally Spark will use its own ExternalAppendOnlyMap which uses custom open hash table implementation ( AppendOnlyMap ) which doesn't provide any ordering guarantees. 在内部，Spark将使用其自己的ExternalAppendOnlyMap ，它使用自定义开放哈希表实现（ AppendOnlyMap ），该实现不提供任何排序保证。

If you check debug string: 如果检查调试字符串：

zipped.toDebugString

(4) ZippedPartitionsRDD3[8] at zipPartitions at <console>:36 []
 |  MapPartitionsRDD[7] at map at <console>:31 []
 |  MapPartitionsRDD[6] at cogroup at <console>:31 []
 |  CoGroupedRDD[5] at cogroup at <console>:31 []
 |  ShuffledRDD[2] at partitionBy at <console>:27 []
 |      CachedPartitions: 4; MemorySize: 512.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
 +-(4) MapPartitionsRDD[1] at map at <console>:26 []
    |  ParallelCollectionRDD[0] at parallelize at <console>:26 []
 +-(4) MapPartitionsRDD[4] at map at <console>:29 []
    |  ParallelCollectionRDD[3] at parallelize at <console>:29 []
 |  ShuffledRDD[2] at partitionBy at <console>:27 []
 |      CachedPartitions: 4; MemorySize: 512.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
 +-(4) MapPartitionsRDD[1]...

you'll see that Spark indeed uses CachedPartitions to compute zipped RDD . 您会看到Spark确实使用CachedPartitions来计算zipped RDD 。 If you also skip map transformations, which removes partitioner, you'll see that coGroup reuses partitioner provided by rdd1 : 如果您还跳过map转换，从而删除了分区程序，则会看到coGroup重用了rdd1提供的分区rdd1 ：

rdd1.cogroup(rdd2).partitioner == rdd1.partitioner

Boolean = true

在Apache Spark协作组中，如何确保不移动大于2个操作数的1个RDD？

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-10-23 11:39:53

在Apache Spark协作组中，如何确保不移动大于2个操作数的1个RDD？

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-10-23 11:39:53

解决方案1
2 已采纳 2017-10-23 11:39:53