在python Spark中组合2个RDD

Question

I have 2 RDDs. 我有2个RDD。 Assume rdd1 = {'a','b','c', 'a', 'c', 'a'} and rdd2 is an output of KMeans with cluster assignment as follows -> rdd2={0,0,1,1,1,0}. 假设rdd1 = {'a'，'b'，'c'，'a'，'c'，'a'}并且rdd2是KMeans的输出，其簇分配如下-> rdd2 = {0,0,1 ，1,1,0}。 I want to eventually find out how many a's and b's are there in cluster 0 and 1. For example 0 has 2 a's so something like {0, a, 2} etc. Is there a way I combine these 2 RDDS to do such an operation? 我想最终找出簇0和1中有多少个a和b。例如0具有2个a，所以类似{0，a，2}等。有没有办法我将这两个RDDS结合起来做这样一个操作？

Thanks for your help. 谢谢你的帮助。

Answer 1

The below works. 下面的作品。 using tuples and list instead of set wherever appropriate. 使用tuples和list而不是在适当的地方set 。

rdd1 = sc.parallelize(['a','b','c', 'a', 'c', 'a'])
rdd2 = sc.parallelize([0, 0, 1, 1, 1, 0])
rdd = rdd1.zip(rdd2).map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y).map(lambda ((x1,x2),y): (x1,x2,y))
rdd.collect()

Output: 输出：

[('a', 0, 2), ('b', 0, 1), ('c', 1, 2), ('a', 1, 1)]

在python Spark中组合2个RDD

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-02-04 21:01:52

在python Spark中组合2个RDD

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-02-04 21:01:52

解决方案1
0 已采纳 2017-02-04 21:01:52