I am trying to write a simple program to find the sum of hours-logged and sum of miles-logged by a driver. I have applied groupByKey and the RDD look like this now.
(13,CompactBuffer((49,2643), (56,2553), (60,2539), (55,2553), (45,2762), (53,2699), (46,2519), (60,2719), (56,2760), (51,2731), (57,2671), (47,2604), (58,2510), (51,2649), (56,2559), (59,2604), (47,2613), (49,2585), (58,2749), (50,2756), (57,2596), (54,2517), (48,2554), (47,2576), (58,2528), (60,2765), (54,2689), (51,2739), (51,2698), (47,2739), (51,2546), (54,2647), (60,2504), (48,2536), (51,2602), (47,2651), (53,2545), (48,2665), (55,2670), (60,2524), (48,2612), (60,2712), (60,2583), (47,2773), (57,2589), (51,2512), (57,2607), (57,2576), (53,2604), (59,2702), (51,2687), (10,100)))
Could you suggest me some useful scala functions to get the sum of the 2 elements? Thanks!!
If I understand your question correctly, here's one approach using groupByKey
, mapValues
and reduce
to aggregate hours and miles:
val rdd = sc.parallelize(Seq(
(13, (49,2643)),
(13, (56,2553)),
(13, (60,2539)),
(14, (40,1500)),
(14, (50,2500))
))
rdd.groupByKey.mapValues( _.reduce( (a, x) => (a._1 + x._1, a._2 + x._2) ) )
// res1: Array[(Int, (Int, Int))] = Array((13,(165,7735)), (14,(90,4000)))
Or as pointed out by commenters, aggregate using reduceByKey
directly if you don't need the intermediary result from groupByKey
:
rdd.reduceByKey( (a, x) => (a._1 + x._1, a._2 + x._2) )
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.