简体   繁体   English

如何在spark中通过按键聚合同时找到最大值和最小值?

[英]How to find max and min simultaneously using aggregate by key in spark?

I have tried this code to find out but i got error:我试过这段代码来找出答案,但我得到了错误:

val keysWithValuesList = Array("1=2000", "2=1800", "2=3000", "3=2500", "4=1500")
val data = sc.parallelize(keysWithValuesList,2)
val kv = data.map(_.split("=")).map(v => (1, v(1).toInt))
val initialCount = kv.first._2
val maxi = (x: Int, y: Int) => if (x>y) x else y 
val mini = (x: Int, y: Int) => if (x>y) y else x 
val maxP = (p1: Int, p2: Int) => if (p1>p2) p1 else p2
val minP = (p1: Int, p2: Int) => if (p1>p2) p2 else p1
val max_min = kv.aggregateByKey(initialCount)((maxi,mini),(maxP,minP))

error is:-错误是:-

command-2654386024166474:13: error: type mismatch;
 found   : ((Int, Int) => Int, (Int, Int) => Int)
 required: (Int, Int) => Int
val max_min = kv.aggregateByKey(initialCount)((maxi,mini),(maxP,minP))
                                              ^
command-2654386024166474:13: error: type mismatch;
 found   : ((Int, Int) => Int, (Int, Int) => Int)
 required: (Int, Int) => Int
val max_min = kv.aggregateByKey(initialCount)((maxi,mini),(maxP,minP))

Is there any other method?, please suggest还有其他方法吗,请指教

It's possible to do two reduce operations at a time, but you will need to use tuples.可以一次执行两个 reduce 操作,但您需要使用元组。 First format your RDD to duplicate the value:首先格式化您的 RDD 以复制该值:

val rddMinMax = kv.map(x => (x._1, (x._2, x._2)))

Then use this function to reduce twice on each pair:然后使用此函数对每对减少两次:

val minAndMax = ((l1: (Int, Int), l2: (Int, Int)) => (mini(l1._1, l2._1), maxi(l1._2, l2._2)))
rddMinMax.reduceByKey(minAndMax).collect()

i have found my solution:-我找到了我的解决方案:-

val list = Array("1=2000", "2=1800", "2=500", "3=2500", "4=4500")
val data = sc.parallelize(list,6)
//Create key value pairs
val kv = data.map(_.split("=")).map(v => (1, v(1).toInt))
val initialCount = (kv.first._2, kv.first._2)
val min_max = (x:(Int,Int),y:Int) => {(if (x._1>y) x._1 else y, if(x._2>y) y else x._2)} 
val min_maxP=(p1:(Int,Int),p2:(Int,Int)) => {(if (p1._1>p2._1) p1._1 else p2._1, if (p1._2>p2._2) p2._2 else p1._2)}
val minimum = kv.aggregateByKey(initialCount)(min_max,min_maxP)
minimum.first._2

output is:-输出是:-

list: Array[String] = Array(1=2000, 2=1800, 2=500, 3=2500, 4=4500)
data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[164] at parallelize at command-110260081440638:2
kv: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[166] at map at command-110260081440638:4
initialCount: (Int, Int) = (2000,2000)
min_max: ((Int, Int), Int) => (Int, Int) = <function2>
min_maxP: ((Int, Int), (Int, Int)) => (Int, Int) = <function2>
minimum: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = ShuffledRDD[167] at aggregateByKey at command-110260081440638:8
res29: (Int, Int) = (4500,500)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM