简体   繁体   English

如何在Spark(Python)中将两个rdd合并到rdd

[英]How to combine two rdd into on rdd in spark(Python)

For example, there are two rdds such as "rdd1 = [[1,2],[3,4]], rdd2 = [[5,6],[7,8]]". 例如,有两个rdds,例如“ rdd1 = [[1,2 ,, [3,4]],rdd2 = [[5,6],[7,8]]”。 And how to combine both into this style: [[1,2,5,6],[3,4,7,8]]. 以及如何将两者组合为这种样式:[[1,2,5,6],[3,4,7,8]]。 Is there any function can solve this problem? 有什么功能可以解决这个问题?

You need to basically combine your rdds together using rdd.zip() and perform map operation on the resulting rdd to get your desired output : 您基本上需要使用rdd.zip()将rdds组合在一起,并对生成的rdd执行map操作,以获得所需的输出:

rdd1 = sc.parallelize([[1,2],[3,4]])
rdd2 = sc.parallelize([[5,6],[7,8]])

#Zip the two rdd together
rdd_temp = rdd1.zip(rdd2)

#Perform Map operation to get your desired output by flattening each element
#Reference : https://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python
rdd_final = rdd_temp.map(lambda x: [item for sublist in x for item in sublist])

#rdd_final.collect()
#Output : [[1, 2, 5, 6], [3, 4, 7, 8]]

You can also check out the results on the Databricks notebook at this link . 您也可以在此链接上在Databricks笔记本上查看结果。

Another (longer) way to achieve this using rdd join: 使用rdd join实现此目的的另一种(较长)方法:

rdd1 = sc.parallelize([[1,2],[3,4]])
rdd2 = sc.parallelize([[5,6],[7,8]])

# create keys for join
rdd1=rdd1.zipWithIndex().map(lambda (val, key): (key,val))
rdd2=rdd2.zipWithIndex().map(lambda (val, key): (key,val))
# join and flatten output
rdd_joined=rdd1.join(rdd2).map(lambda (key, (val1, val2)): val1+val2)

rdd_joined.take(2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM