简体   繁体   中英

How to get the sum of two elements from the Spark RDD Iiterable

I am trying to write a simple program to find the sum of hours-logged and sum of miles-logged by a driver. I have applied groupByKey and the RDD look like this now.

(13,CompactBuffer((49,2643), (56,2553), (60,2539), (55,2553), (45,2762), (53,2699), (46,2519), (60,2719), (56,2760), (51,2731), (57,2671), (47,2604), (58,2510), (51,2649), (56,2559), (59,2604), (47,2613), (49,2585), (58,2749), (50,2756), (57,2596), (54,2517), (48,2554), (47,2576), (58,2528), (60,2765), (54,2689), (51,2739), (51,2698), (47,2739), (51,2546), (54,2647), (60,2504), (48,2536), (51,2602), (47,2651), (53,2545), (48,2665), (55,2670), (60,2524), (48,2612), (60,2712), (60,2583), (47,2773), (57,2589), (51,2512), (57,2607), (57,2576), (53,2604), (59,2702), (51,2687), (10,100)))

Could you suggest me some useful scala functions to get the sum of the 2 elements? Thanks!!

If I understand your question correctly, here's one approach using groupByKey , mapValues and reduce to aggregate hours and miles:

val rdd = sc.parallelize(Seq(
  (13, (49,2643)),
  (13, (56,2553)),
  (13, (60,2539)),
  (14, (40,1500)),
  (14, (50,2500))
))

rdd.groupByKey.mapValues( _.reduce( (a, x) => (a._1 + x._1, a._2 + x._2) ) )
// res1: Array[(Int, (Int, Int))] = Array((13,(165,7735)), (14,(90,4000)))

Or as pointed out by commenters, aggregate using reduceByKey directly if you don't need the intermediary result from groupByKey :

rdd.reduceByKey( (a, x) => (a._1 + x._1, a._2 + x._2) ) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM