简体   繁体   English

如何基于Key从PairRDD获取新的RDD

[英]How to get new RDD from PairRDD based on Key

In my Spark Application, I am using one JavaPairRDD<Integer, List<Tuple3<String, String, String>>> which has large amount of data. 在我的Spark应用程序中,我正在使用一个JavaPairRDD<Integer, List<Tuple3<String, String, String>>> ,它具有大量数据。

And my requirement is that i need some other RDDs JavaRDD<Tuple3<String, String, String>> from that Large PairRDD based on keys. 我的要求是我需要基于键对那个大PairRDD中的其他RDD JavaRDD<Tuple3<String, String, String>>

I don't know the Java API, but here's how you would do it in Scala (in spark-shell ): 我不知道Java API,但是您可以在Scala( spark-shell )中执行以下操作:

def rddByKey[K: ClassTag, V: ClassTag](rdd: RDD[(K, Seq[V])]) = {
  rdd.keys.distinct.collect.map {
    key => key -> rdd.filter(_._1 == key).values.flatMap(identity)
  }
}

You have to filter for each key and flatten the List s with flatMap . 您必须filter每个键,并用flatMap展平List

I have to mention that this is not a useful operation. 我不得不提到这不是一个有用的操作。 If you were able to build the original RDD, that means each List is small enough to fit into memory. 如果您能够构建原始的RDD,则意味着每个List都足够小以适合内存。 So I don't see why you would want to make them into RDDs. 因此,我不明白您为什么要将它们放入RDD。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM