[英]Avoid Shuffling with ReduceByKey in Spark
I am taking the coursera Course on Scala Spark , and I am trying to optimize this snippet: 我正在学习Scala Spark课程 ,我正在尝试优化这个片段:
val indexedMeansG = vectors.
map(v => findClosest(v, means) -> v).
groupByKey.mapValues(averageVectors)
vectors
is a RDD[(Int, Int)]
, in order to see the list of dependencies and the lineage of the RDD I've used: vectors
是RDD[(Int, Int)]
,以便查看依赖项列表和我使用过的RDD的谱系:
println(s"""GroupBy:
| Deps: ${indexedMeansG.dependencies.size}
| Deps: ${indexedMeansG.dependencies}
| Lineage: ${indexedMeansG.toDebugString}""".stripMargin)
Which shows this: 这显示了这个:
/* GroupBy:
* Deps: 1
* Deps: List(org.apache.spark.OneToOneDependency@44d1924)
* Lineage: (6) MapPartitionsRDD[18] at mapValues at StackOverflow.scala:207 []
* ShuffledRDD[17] at groupByKey at StackOverflow.scala:207 []
* +-(6) MapPartitionsRDD[16] at map at StackOverflow.scala:206 []
* MapPartitionsRDD[13] at map at StackOverflow.scala:139 []
* CachedPartitions: 6; MemorySize: 84.0 MB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
* MapPartitionsRDD[12] at values at StackOverflow.scala:116 []
* MapPartitionsRDD[11] at mapValues at StackOverflow.scala:115 []
* MapPartitionsRDD[10] at groupByKey at StackOverflow.scala:92 []
* MapPartitionsRDD[9] at join at StackOverflow.scala:91 []
* MapPartitionsRDD[8] at join at StackOverflow.scala:91 []
* CoGroupedRDD[7] at join at StackOverflow.scala:91 []
* +-(6) MapPartitionsRDD[4] at map at StackOverflow.scala:88 []
* | MapPartitionsRDD[3] at filter at StackOverflow.scala:88 []
* | MapPartitionsRDD[2] at map at StackOverflow.scala:69 []
* | src/main/resources/stackoverflow/stackoverflow.csv MapPartitionsRDD[1] at textFile at StackOverflow.scala:23 []
* | src/main/resources/stackoverflow/stackoverflow.csv HadoopRDD[0] at textFile at StackOverflow.scala:23 []
* +-(6) MapPartitionsRDD[6] at map at StackOverflow.scala:89 []
* MapPartitionsRDD[5] at filter at StackOverflow.scala:89 []
* MapPartitionsRDD[2] at map at StackOverflow.scala:69 []
* src/main/resources/stackoverflow/stackoverflow.csv MapPartitionsRDD[1] at textFile at StackOverflow.scala:23 []
* src/main/resources/stackoverflow/stackoverflow.csv HadoopRDD[0] at textFile at StackOverflow.scala:23 [] */
From this List(org.apache.spark.OneToOneDependency@44d1924)
I deduced no shuffling is being done, Am I right? 从这个List(org.apache.spark.OneToOneDependency@44d1924)
我推断出没有洗牌,我是对的吗? However, below ShuffledRDD[17]
is printed, which means in fact there is shuffling. 但是,在ShuffledRDD[17]
之下打印,这意味着实际上有改组。
I've tried to replace that groupByKey
call with a reduceByKey
, like this: 我试图取代groupByKey
与呼叫reduceByKey
,就像这样:
val indexedMeansR = vectors.
map(v => findClosest(v, means) -> v).
reduceByKey((a, b) => (a._1 + b._1) / 2 -> (a._2 + b._2) / 2)
And its dependencies and lineage are: 它的依赖关系和血统是:
/* ReduceBy:
* Deps: 1
* Deps: List(org.apache.spark.ShuffleDependency@4d5e813f)
* Lineage: (6) ShuffledRDD[17] at reduceByKey at StackOverflow.scala:211 []
* +-(6) MapPartitionsRDD[16] at map at StackOverflow.scala:210 []
* MapPartitionsRDD[13] at map at StackOverflow.scala:139 []
* CachedPartitions: 6; MemorySize: 84.0 MB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
* MapPartitionsRDD[12] at values at StackOverflow.scala:116 []
* MapPartitionsRDD[11] at mapValues at StackOverflow.scala:115 []
* MapPartitionsRDD[10] at groupByKey at StackOverflow.scala:92 []
* MapPartitionsRDD[9] at join at StackOverflow.scala:91 []
* MapPartitionsRDD[8] at join at StackOverflow.scala:91 []
* CoGroupedRDD[7] at join at StackOverflow.scala:91 []
* +-(6) MapPartitionsRDD[4] at map at StackOverflow.scala:88 []
* | MapPartitionsRDD[3] at filter at StackOverflow.scala:88 []
* | MapPartitionsRDD[2] at map at StackOverflow.scala:69 []
* | src/main/resources/stackoverflow/stackoverflow.csv MapPartitionsRDD[1] at textFile at StackOverflow.scala:23 []
* | src/main/resources/stackoverflow/stackoverflow.csv HadoopRDD[0] at textFile at StackOverflow.scala:23 []
* +-(6) MapPartitionsRDD[6] at map at StackOverflow.scala:89 []
* MapPartitionsRDD[5] at filter at StackOverflow.scala:89 []
* MapPartitionsRDD[2] at map at StackOverflow.scala:69 []
* src/main/resources/stackoverflow/stackoverflow.csv MapPartitionsRDD[1] at textFile at StackOverflow.scala:23 []
* src/main/resources/stackoverflow/stackoverflow.csv HadoopRDD[0] at textFile at StackOverflow.scala:23 [] */
This time, the dependency is ShuffleDependency
and I am not able to understand why. 这次,依赖是ShuffleDependency
,我无法理解为什么。
Since the RDD is a pair which keys are Ints , and therefore have an ordering, I've also tried to modified the partitioner and use a RangePartitioner
, but it doesn't improve either 由于RDD是一对密钥是Ints ,因此有一个排序,我也尝试修改分区并使用RangePartitioner
,但它也没有改进
A reduceByKey
operation still involves a shuffle, as it's still required to ensure that all items with the same key become part of the same partition. reduceByKey
操作仍然涉及shuffle,因为仍然需要确保具有相同键的所有项成为同一分区的一部分。
However, this will be a much smaller shuffle operation than a groupByKey
operation. 但是,这将是一个比groupByKey
操作小得多的shuffle操作。 A reduceByKey
will perform the reduction operation within each partition before shuffling, thus reducing the amount of data to be shuffled. reduceByKey
将在混洗之前在每个分reduceByKey
执行缩减操作,从而减少要混洗的数据量。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.