繁体   English   中英

如何从Spark RDD Iiterable获取两个元素的总和

[英]How to get the sum of two elements from the Spark RDD Iiterable

我正在尝试编写一个简单的程序来查找驾驶员记录的小时数和里程数的总和。 我现在已经应用了groupByKey和RDD了。

(13,CompactBuffer((49,2643), (56,2553), (60,2539), (55,2553), (45,2762), (53,2699), (46,2519), (60,2719), (56,2760), (51,2731), (57,2671), (47,2604), (58,2510), (51,2649), (56,2559), (59,2604), (47,2613), (49,2585), (58,2749), (50,2756), (57,2596), (54,2517), (48,2554), (47,2576), (58,2528), (60,2765), (54,2689), (51,2739), (51,2698), (47,2739), (51,2546), (54,2647), (60,2504), (48,2536), (51,2602), (47,2651), (53,2545), (48,2665), (55,2670), (60,2524), (48,2612), (60,2712), (60,2583), (47,2773), (57,2589), (51,2512), (57,2607), (57,2576), (53,2604), (59,2702), (51,2687), (10,100)))

您能否建议我一些有用的scala函数来获取两个元素的总和? 谢谢!!

如果我正确理解了您的问题,这是一种使用groupByKeymapValuesreduce总计小时和里程的方法:

val rdd = sc.parallelize(Seq(
  (13, (49,2643)),
  (13, (56,2553)),
  (13, (60,2539)),
  (14, (40,1500)),
  (14, (50,2500))
))

rdd.groupByKey.mapValues( _.reduce( (a, x) => (a._1 + x._1, a._2 + x._2) ) )
// res1: Array[(Int, (Int, Int))] = Array((13,(165,7735)), (14,(90,4000)))

或如评论者所指出的,如果不需要groupByKey的中间结果, groupByKey直接使用reduceByKey聚合:

rdd.reduceByKey( (a, x) => (a._1 + x._1, a._2 + x._2) ) 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM