[英]How to get new RDD from PairRDD based on Key
In my Spark Application, I am using one JavaPairRDD<Integer, List<Tuple3<String, String, String>>>
which has large amount of data. 在我的Spark应用程序中,我正在使用一个
JavaPairRDD<Integer, List<Tuple3<String, String, String>>>
,它具有大量数据。
And my requirement is that i need some other RDDs JavaRDD<Tuple3<String, String, String>>
from that Large PairRDD based on keys. 我的要求是我需要基于键对那个大PairRDD中的其他RDD
JavaRDD<Tuple3<String, String, String>>
。
I don't know the Java API, but here's how you would do it in Scala (in spark-shell
): 我不知道Java API,但是您可以在Scala(
spark-shell
)中执行以下操作:
def rddByKey[K: ClassTag, V: ClassTag](rdd: RDD[(K, Seq[V])]) = {
rdd.keys.distinct.collect.map {
key => key -> rdd.filter(_._1 == key).values.flatMap(identity)
}
}
You have to filter
for each key and flatten the List
s with flatMap
. 您必须
filter
每个键,并用flatMap
展平List
。
I have to mention that this is not a useful operation. 我不得不提到这不是一个有用的操作。 If you were able to build the original RDD, that means each
List
is small enough to fit into memory. 如果您能够构建原始的RDD,则意味着每个
List
都足够小以适合内存。 So I don't see why you would want to make them into RDDs. 因此,我不明白您为什么要将它们放入RDD。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.