简体   繁体   English

在 Spark RDD 中寻找最大值

[英]Finding the max value in Spark RDD

From the following, how can I get the tuple with the highest value?从以下内容中,如何获得具有最高值的元组?

Array[(String, Int)] = Array((a,30),(b,50),(c,20))

In this example the result I want would be (b,50)在这个例子中,我想要的结果是(b,50)

You could use reduce() :您可以使用reduce()

val max_tuple = rdd.reduce((acc,value) => { 
  if(acc._2 < value._2) value else acc})
//max_tuple: (String, Int) = (b,50)

Data数据

val rdd = sc.parallelize(Array(("a",30),("b",50),("c",20)))

If the elements are always tuples of two elements you could simply:如果元素总是两个元素的元组,你可以简单地:

Array((a,30),(b,50),(c,20)).maxBy(_._2)

As specified in the docs .文档中所述。

If you are new to spark, I should tell you that you have to use Dataframe s as much as possible, they have a lot of advantages comparing with RDD s, with Dataframe s you can get the max like this:如果您是 spark 新手,我应该告诉您,您必须尽可能多地使用Dataframe s,它们与RDD s 相比有很多优势,使用Dataframe s 您可以获得最大值,如下所示:

import spark.implicits._
import org.apache.spark.sql.functions.max
val df = Seq(("a",30),("b",50),("c",20)).toDF("x", "y")
val x = df.sort($"y".desc).first()

Disclaimer: as @Mandy007 noted in the comments, this solution is more computationally expensive speaking because it must be ordered免责声明:正如@Mandy007 在评论中指出的,这个解决方案的计算成本更高,因为它必须被订购

This should work, it works for me at least.这应该有效,至少对我有效。 hope this helps you.希望这对你有帮助。

reduce() returns wrong result for me. reduce()为我返回错误的结果。 There are some other options:还有一些其他选择:

val maxTemp2 = rdd.max()(Ordering[Int].on(x=>x._2))
val maxTemp3 = rdd.sortBy[Int](x=>x._2).take(1)(0)

Data数据

val rdd = sc.parallelize(Array(("a",30),("b",50),("c",20)))

rdd.reduceByKey((a,b)=>a+b).collect.maxBy(_._2) rdd.reduceByKey((a,b)=>a+b).collect.maxBy(_._2)

we can use maxBy on collect like this我们可以像这样在 collect 上使用 maxBy

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM