如何在Spark（Python）中将两个rdd合并到rdd

Question

For example, there are two rdds such as "rdd1 = [[1,2],[3,4]], rdd2 = [[5,6],[7,8]]". 例如，有两个rdds，例如“ rdd1 = [[1,2 ,, [3,4]]，rdd2 = [[5,6]，[7,8]]”。 And how to combine both into this style: [[1,2,5,6],[3,4,7,8]]. 以及如何将两者组合为这种样式：[[1,2,5,6]，[3,4,7,8]]。 Is there any function can solve this problem? 有什么功能可以解决这个问题？

Answer 1

You need to basically combine your rdds together using rdd.zip() and perform map operation on the resulting rdd to get your desired output : 您基本上需要使用rdd.zip()将rdds组合在一起，并对生成的rdd执行map操作，以获得所需的输出：

rdd1 = sc.parallelize([[1,2],[3,4]])
rdd2 = sc.parallelize([[5,6],[7,8]])

#Zip the two rdd together
rdd_temp = rdd1.zip(rdd2)

#Perform Map operation to get your desired output by flattening each element
#Reference : https://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python
rdd_final = rdd_temp.map(lambda x: [item for sublist in x for item in sublist])

#rdd_final.collect()
#Output : [[1, 2, 5, 6], [3, 4, 7, 8]]

You can also check out the results on the Databricks notebook at this link . 您也可以在此链接上在Databricks笔记本上查看结果。

Answer 2

Another (longer) way to achieve this using rdd join: 使用rdd join实现此目的的另一种（较长）方法：

rdd1 = sc.parallelize([[1,2],[3,4]])
rdd2 = sc.parallelize([[5,6],[7,8]])

# create keys for join
rdd1=rdd1.zipWithIndex().map(lambda (val, key): (key,val))
rdd2=rdd2.zipWithIndex().map(lambda (val, key): (key,val))
# join and flatten output
rdd_joined=rdd1.join(rdd2).map(lambda (key, (val1, val2)): val1+val2)

rdd_joined.take(2)

如何在Spark（Python）中将两个rdd合并到rdd

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-10-27 07:49:14

解决方案2
0 2017-10-27 09:18:05

如何在Spark（Python）中将两个rdd合并到rdd

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-10-27 07:49:14

解决方案2 0 2017-10-27 09:18:05

解决方案1
2 已采纳 2017-10-27 07:49:14

解决方案2
0 2017-10-27 09:18:05