How to get the sum of two elements from the Spark RDD Iiterable

Question

I am trying to write a simple program to find the sum of hours-logged and sum of miles-logged by a driver. I have applied groupByKey and the RDD look like this now.

(13,CompactBuffer((49,2643), (56,2553), (60,2539), (55,2553), (45,2762), (53,2699), (46,2519), (60,2719), (56,2760), (51,2731), (57,2671), (47,2604), (58,2510), (51,2649), (56,2559), (59,2604), (47,2613), (49,2585), (58,2749), (50,2756), (57,2596), (54,2517), (48,2554), (47,2576), (58,2528), (60,2765), (54,2689), (51,2739), (51,2698), (47,2739), (51,2546), (54,2647), (60,2504), (48,2536), (51,2602), (47,2651), (53,2545), (48,2665), (55,2670), (60,2524), (48,2612), (60,2712), (60,2583), (47,2773), (57,2589), (51,2512), (57,2607), (57,2576), (53,2604), (59,2702), (51,2687), (10,100)))

Could you suggest me some useful scala functions to get the sum of the 2 elements? Thanks!!

Answer 1

If I understand your question correctly, here's one approach using groupByKey , mapValues and reduce to aggregate hours and miles:

val rdd = sc.parallelize(Seq(
  (13, (49,2643)),
  (13, (56,2553)),
  (13, (60,2539)),
  (14, (40,1500)),
  (14, (50,2500))
))

rdd.groupByKey.mapValues( _.reduce( (a, x) => (a._1 + x._1, a._2 + x._2) ) )
// res1: Array[(Int, (Int, Int))] = Array((13,(165,7735)), (14,(90,4000)))

Or as pointed out by commenters, aggregate using reduceByKey directly if you don't need the intermediary result from groupByKey :

rdd.reduceByKey( (a, x) => (a._1 + x._1, a._2 + x._2) )

How to get the sum of two elements from the Spark RDD Iiterable

Question

1 answers

solution1
1 ACCPTED 2017-12-05 06:06:36

How to get the sum of two elements from the Spark RDD Iiterable

Question

1 answers

solution1 1 ACCPTED 2017-12-05 06:06:36

solution1
1 ACCPTED 2017-12-05 06:06:36