如何基于Key从PairRDD获取新的RDD

Question

In my Spark Application, I am using one JavaPairRDD<Integer, List<Tuple3<String, String, String>>> which has large amount of data. 在我的Spark应用程序中，我正在使用一个JavaPairRDD<Integer, List<Tuple3<String, String, String>>> ，它具有大量数据。

And my requirement is that i need some other RDDs JavaRDD<Tuple3<String, String, String>> from that Large PairRDD based on keys. 我的要求是我需要基于键对那个大PairRDD中的其他RDD JavaRDD<Tuple3<String, String, String>> 。

Answer 1

I don't know the Java API, but here's how you would do it in Scala (in spark-shell ): 我不知道Java API，但是您可以在Scala（ spark-shell ）中执行以下操作：

def rddByKey[K: ClassTag, V: ClassTag](rdd: RDD[(K, Seq[V])]) = {
  rdd.keys.distinct.collect.map {
    key => key -> rdd.filter(_._1 == key).values.flatMap(identity)
  }
}

You have to filter for each key and flatten the List s with flatMap . 您必须filter每个键，并用flatMap展平List 。

I have to mention that this is not a useful operation. 我不得不提到这不是一个有用的操作。 If you were able to build the original RDD, that means each List is small enough to fit into memory. 如果您能够构建原始的RDD，则意味着每个List都足够小以适合内存。 So I don't see why you would want to make them into RDDs. 因此，我不明白您为什么要将它们放入RDD。

如何基于Key从PairRDD获取新的RDD

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-04-02 12:12:58

如何基于Key从PairRDD获取新的RDD

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-04-02 12:12:58

解决方案1
3 已采纳 2015-04-02 12:12:58