简体   繁体   中英

In Apache Spark cogroup, how to make sure 1 RDD of >2 operands is not moved?

In a cogroup transformation, eg RDD1.cogroup(RDD2, ...), I used to assume that Spark only shuffles/moves RDD2 and retains RDD1's partitioning and in-memory storage if:

  1. RDD1 has an explicit partitioner
  2. RDD1 is cached.

In my other projects most of the shuffling behaviour seems to be consistent with this assumption. So yesterday I wrote a short scala program to prove it once and for all:

// sc is the SparkContext
val rdd1 = sc.parallelize(1 to 10, 4).map(v => v->v)
  .partitionBy(new HashPartitioner(4))
rdd1.persist().count()
val rdd2 = sc.parallelize(1 to 10, 4).map(v => (11-v)->v)

val cogrouped = rdd1.cogroup(rdd2).map {
  v =>
    v._2._1.head -> v._2._2.head
}

val zipped = cogrouped.zipPartitions(rdd1, rdd2) {
  (itr1, itr2, itr3) =>
    itr1.zipAll(itr2.map(_._2), 0->0, 0).zipAll(itr3.map(_._2), (0->0)->0, 0)
      .map {
        v =>
          (v._1._1._1, v._1._1._2, v._1._2, v._2)
      }
}

zipped.collect().foreach(println)

If rdd1 doesn't move the first column of zipped should have the same value as the third column, so I ran the programs, oops:

(4,7,4,1)
(8,3,8,2)
(1,10,1,3)
(9,2,5,4)
(5,6,9,5)
(6,5,2,6)
(10,1,6,7)
(2,9,10,0)
(3,8,3,8)
(7,4,7,9)
(0,0,0,10)

The assumption is not true. Spark probably did some internal optimisation and decided that regenerating rdd1's partitions is much faster than keeping them in cache.

So the question is: If my programmatic requirement to not move RDD1 (and keep it cached) is because of other reasons than speed (eg resource locality), or in some occasions Spark internal optimisation is not preferrable, is there a way to explicitly instruct the framework to not move an operand in all cogroup-like operations? This also include join, outer join, and groupWith.

Thanks a lot for your help. So far I'm using broadcast join as a not-so-scalable makeshift solution, it is not going to last long before crashing my cluster. I'm expecting a solution consistent with the distributed computing principal.

If rdd1 doesn't move the first column of zipped should have the same value as the third column

This assumption is just incorrect. Creating CoGroupedRDD is not only about shuffle, but also about generating internal structures required for matching corresponding records. Internally Spark will use its own ExternalAppendOnlyMap which uses custom open hash table implementation ( AppendOnlyMap ) which doesn't provide any ordering guarantees.

If you check debug string:

zipped.toDebugString
(4) ZippedPartitionsRDD3[8] at zipPartitions at <console>:36 []
 |  MapPartitionsRDD[7] at map at <console>:31 []
 |  MapPartitionsRDD[6] at cogroup at <console>:31 []
 |  CoGroupedRDD[5] at cogroup at <console>:31 []
 |  ShuffledRDD[2] at partitionBy at <console>:27 []
 |      CachedPartitions: 4; MemorySize: 512.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
 +-(4) MapPartitionsRDD[1] at map at <console>:26 []
    |  ParallelCollectionRDD[0] at parallelize at <console>:26 []
 +-(4) MapPartitionsRDD[4] at map at <console>:29 []
    |  ParallelCollectionRDD[3] at parallelize at <console>:29 []
 |  ShuffledRDD[2] at partitionBy at <console>:27 []
 |      CachedPartitions: 4; MemorySize: 512.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
 +-(4) MapPartitionsRDD[1]...

you'll see that Spark indeed uses CachedPartitions to compute zipped RDD . If you also skip map transformations, which removes partitioner, you'll see that coGroup reuses partitioner provided by rdd1 :

rdd1.cogroup(rdd2).partitioner == rdd1.partitioner
Boolean = true

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM