![](/img/trans.png)
[英]How to get Running sum of based on two columns using Spark scala RDD
[英]How to get the sum of two elements from the Spark RDD Iiterable
我正在尝试编写一个简单的程序来查找驾驶员记录的小时数和里程数的总和。 我现在已经应用了groupByKey和RDD了。
(13,CompactBuffer((49,2643), (56,2553), (60,2539), (55,2553), (45,2762), (53,2699), (46,2519), (60,2719), (56,2760), (51,2731), (57,2671), (47,2604), (58,2510), (51,2649), (56,2559), (59,2604), (47,2613), (49,2585), (58,2749), (50,2756), (57,2596), (54,2517), (48,2554), (47,2576), (58,2528), (60,2765), (54,2689), (51,2739), (51,2698), (47,2739), (51,2546), (54,2647), (60,2504), (48,2536), (51,2602), (47,2651), (53,2545), (48,2665), (55,2670), (60,2524), (48,2612), (60,2712), (60,2583), (47,2773), (57,2589), (51,2512), (57,2607), (57,2576), (53,2604), (59,2702), (51,2687), (10,100)))
您能否建议我一些有用的scala函数来获取两个元素的总和? 谢谢!!
如果我正确理解了您的问题,这是一种使用groupByKey
, mapValues
并reduce
总计小时和里程的方法:
val rdd = sc.parallelize(Seq(
(13, (49,2643)),
(13, (56,2553)),
(13, (60,2539)),
(14, (40,1500)),
(14, (50,2500))
))
rdd.groupByKey.mapValues( _.reduce( (a, x) => (a._1 + x._1, a._2 + x._2) ) )
// res1: Array[(Int, (Int, Int))] = Array((13,(165,7735)), (14,(90,4000)))
或如评论者所指出的,如果不需要groupByKey
的中间结果, groupByKey
直接使用reduceByKey
聚合:
rdd.reduceByKey( (a, x) => (a._1 + x._1, a._2 + x._2) )
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.