spark reduceByKey performance/complexity when reducing lists with scala

Question

I need to perform a reduceByKey on lists. What would be the fastest solution ? I'm using the ::: operator to merge 2 list in the reduce operation, but ::: is O(n) so I am afraid the reduce operation will end up being O(n ² ) .

Code example :

val rdd: RDD[int, List[int]] = getMyRDD()
rdd.reduceByKey(_ ::: _)

What would be the best/most efficient solution ?

Answer 1

The best you can do is:

rdd.groupByKey.mapValues(_.flatten.toList)

This will:

Skip obsolete map-side reduce. It requires marginally larger shuffle but significantly reduces GC time.
Use mutable buffer with amortized constant append time for intermediate aggregations.
Flatten intermediate aggregate in O(N) time.

If you want map-side reduction you can use aggregateByKey :

import scala.collection.mutable.ArrayBuffer

rdd.aggregateByKey(ArrayBuffer[Int]())(_ ++= _, _ ++= _).mapValues(_.toList)

but usually it will be significantly more expensive compared to the first solution.

spark reduceByKey performance/complexity when reducing lists with scala

Question

1 answers

solution1
2 ACCPTED 2016-11-20 15:19:53

spark reduceByKey performance/complexity when reducing lists with scala

Question

1 answers

solution1 2 ACCPTED 2016-11-20 15:19:53

solution1
2 ACCPTED 2016-11-20 15:19:53