[英]Spark Aggregatebykey partitioner order
If I apply a hash partitioner to Spark's aggregatebykey function, ie myRDD.aggregateByKey(0, new HashPartitioner(20))(combOp, mergeOp)
如果我将哈希分区程序应用于Spark的aggregatebykey函数,即
myRDD.aggregateByKey(0, new HashPartitioner(20))(combOp, mergeOp)
Does myRDD get repartitioned first before it's key/value pairs are aggregated using combOp and mergeOp? 在使用combOp和mergeOp聚合键/值对之前,myRDD是否会先进行重新分区? Or does myRDD go through combOp and mergeOp first and the resulting RDD is repartitioned using the HashPartitioner?
还是myRDD首先通过combOp和mergeOp,然后使用HashPartitioner对生成的RDD进行重新分区?
aggregateByKey
applies map side aggregation before eventual shuffle. aggregateByKey
在最终洗牌之前应用地图端聚合。 Since every partition is processed sequentially the only operation that is applied in this phase is initialization (creating zeroValue
) and combOp
. 由于每个分区都是按顺序处理的,因此在此阶段应用的唯一操作是初始化(创建
zeroValue
)和combOp
。 A goal of mergeOp
is to combine aggregation buffers so it is not used before shuffle. mergeOp
的目标是合并聚合缓冲区,因此在混洗之前不使用它。
If input RDD is a ShuffledRDD
with the same partitioner as requested for aggregateByKey
then data is not shuffled at all and data is aggregated locally using mapPartitions
. 如果输入RDD是
ShuffledRDD
与同分区的请求aggregateByKey
则数据根本没有被洗牌和数据使用本地聚集mapPartitions
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.