简体   繁体   English

在键值 RDD 中寻找最大值

[英]Finding Maximum in Key Value RDD

I have a key-value RDD of the form :我有一个键值 RDD 的形式:

(Some(23661587),
CompactBuffer(Posting(2,23661643,Some(23661587),0,None), 
              Posting(2,23661682,Some(23661587),0,None)))

Here Some(23661587) is the key and data inside CompactBuffer is the value.这里Some(23661587)是键, CompactBuffer里面的数据是值。 I want to select the Posting type with maximum value for a particular attribute for each key.我想为每个键的特定属性选择具有最大值的Posting类型。

How can I do that?我怎样才能做到这一点? I have limited experience in Scala and Spark.我在 Scala 和 Spark 方面的经验有限。 Thanks谢谢

I reproduced your example with some data.我用一些数据复制了你的例子。

As @sinanspd said, org.apache.spark.util.collection.CompactBuffer extends from scala.collection.immutable.Seq , you can follow this link CompactBuffer , so you can use methods from scala.collection.immutable.Seq Seq to sort the Seq and get the Posting max value.正如@sinanspd 所说, org.apache.spark.util.collection.CompactBufferscala.collection.immutable.Seq扩展,您可以点击此链接CompactBuffer ,因此您可以使用scala.collection.immutable.Seq Seq 中的方法对Seq 并获得Posting最大值。

My choice was Posting.value to sort the Seq but it could be value2 or any field in Posting class.我的选择是Posting.value对 Seq 进行排序,但它可以是 value2 或 Posting 类中的任何字段。

As an example举个例子

object FindingMaximum {

  val spark = SparkSession
    .builder()
    .appName("FindingMaximum")
    .master("local[*]")
    .getOrCreate()

  val sc = spark.sparkContext

  case class Posting(key: Int, value: Long, value2: Option[Long], value3: Int, value4: Option[Int])

  val data = List((Some(23661587),Seq(Posting(2,23661643,Some(23661587),0,None), Posting(2,23661682,Some(23661587),0,None))),
                  (Some(23661588),Seq(Posting(3,23661743,Some(23661588),0,None), Posting(3,23661682,Some(23661588),0,None))),
                  (Some(23661589),Seq(Posting(4,23661843,Some(23661589),0,None), Posting(4,23661882,Some(23661589),0,None))))

  def main(args: Array[String]): Unit = {

    sc.setLogLevel("ERROR")

    val rdd = sc.parallelize(data)

    val rddKeyMax = rdd.map({case(key, v) =>
      val max = v.sortBy(posting => posting.value).last
      (key, max)
    })
    rddKeyMax.foreach(println)
  }
}

/*
(Some(23661588),Posting(3,23661743,Some(23661588),0,None))
(Some(23661587),Posting(2,23661682,Some(23661587),0,None))
(Some(23661589),Posting(4,23661882,Some(23661589),0,None))
*/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM