[英]Finding the max value in Spark RDD
From the following, how can I get the tuple with the highest value?从以下内容中,如何获得具有最高值的元组?
Array[(String, Int)] = Array((a,30),(b,50),(c,20))
In this example the result I want would be (b,50)
在这个例子中,我想要的结果是(b,50)
You could use reduce()
:您可以使用reduce()
:
val max_tuple = rdd.reduce((acc,value) => {
if(acc._2 < value._2) value else acc})
//max_tuple: (String, Int) = (b,50)
Data数据
val rdd = sc.parallelize(Array(("a",30),("b",50),("c",20)))
If you are new to spark, I should tell you that you have to use Dataframe
s as much as possible, they have a lot of advantages comparing with RDD
s, with Dataframe
s you can get the max like this:如果您是 spark 新手,我应该告诉您,您必须尽可能多地使用Dataframe
s,它们与RDD
s 相比有很多优势,使用Dataframe
s 您可以获得最大值,如下所示:
import spark.implicits._
import org.apache.spark.sql.functions.max
val df = Seq(("a",30),("b",50),("c",20)).toDF("x", "y")
val x = df.sort($"y".desc).first()
Disclaimer: as @Mandy007 noted in the comments, this solution is more computationally expensive speaking because it must be ordered免责声明:正如@Mandy007 在评论中指出的,这个解决方案的计算成本更高,因为它必须被订购
This should work, it works for me at least.这应该有效,至少对我有效。 hope this helps you.希望这对你有帮助。
reduce()
returns wrong result for me. reduce()
为我返回错误的结果。 There are some other options:还有一些其他选择:
val maxTemp2 = rdd.max()(Ordering[Int].on(x=>x._2))
val maxTemp3 = rdd.sortBy[Int](x=>x._2).take(1)(0)
Data数据
val rdd = sc.parallelize(Array(("a",30),("b",50),("c",20)))
rdd.reduceByKey((a,b)=>a+b).collect.maxBy(_._2) rdd.reduceByKey((a,b)=>a+b).collect.maxBy(_._2)
we can use maxBy on collect like this我们可以像这样在 collect 上使用 maxBy
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.