简体   繁体   中英

spark reduceByKey performance/complexity when reducing lists with scala

I need to perform a reduceByKey on lists. What would be the fastest solution ? I'm using the ::: operator to merge 2 list in the reduce operation, but ::: is O(n) so I am afraid the reduce operation will end up being O(n 2 ) .

Code example :

val rdd: RDD[int, List[int]] = getMyRDD()
rdd.reduceByKey(_ ::: _)

What would be the best/most efficient solution ?

The best you can do is:

rdd.groupByKey.mapValues(_.flatten.toList)

This will:

  • Skip obsolete map-side reduce. It requires marginally larger shuffle but significantly reduces GC time.
  • Use mutable buffer with amortized constant append time for intermediate aggregations.
  • Flatten intermediate aggregate in O(N) time.

If you want map-side reduction you can use aggregateByKey :

import scala.collection.mutable.ArrayBuffer

rdd.aggregateByKey(ArrayBuffer[Int]())(_ ++= _, _ ++= _).mapValues(_.toList)

but usually it will be significantly more expensive compared to the first solution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM