spark：如何用另一个RDD的每个分区压缩一个RDD

Question

Let's say I have one RDD[U] that will always consist of only 1 partition. 假设我有一个RDD[U] ，它将始终仅包含1个分区。 My task is fill this RDD up with the contents of another RDD[T] that resides over n number of partitions. 我的任务是用驻留在n个分区上的另一个RDD[T]的内容填充该RDD。 The final output should be n number of partitions of RDD[U] . 最终输出应为n个RDD[U]分区。

What I tried to do originally is: 我最初尝试做的是：

val newRDD = firstRDD.zip(secondRDD).map{ case(a, b)  => a.insert(b)}

But I got an error: Can't zip RDDs with unequal numbers of partitions 但是我遇到一个错误： Can't zip RDDs with unequal numbers of partitions

I can see in the RDD api documentation that there is a method called zipPartitions() . 我可以在RDD api 文档中看到一个名为zipPartitions()的方法。 Is it possible, and if so how, to use this method to zip each partition from RDD[T] with a single and only partition of RDD[U] and perform a map on it as I tried above? 是否有可能，如果可以的话，使用这种方法将RDD[T]每个分区压缩为一个唯一的RDD[U]分区，并像我上面尝试的那样在其上执行映射？

Answer 1

Something like this should work: 这样的事情应该起作用：

val zippedFirstRDD = firstRDD.zipWithIndex.map(_.swap)
val zippedSecondRDD = secondRDD.zipWithIndex.map(_.swap)

zippedFirstRDD.join(zippedSecondRDD)
  .map{case (key, (valueU, valueT)) => {
    valueU.insert(valueT)
  }}

spark：如何用另一个RDD的每个分区压缩一个RDD

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-07-30 17:25:42

spark：如何用另一个RDD的每个分区压缩一个RDD

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-07-30 17:25:42

解决方案1
1 已采纳 2015-07-30 17:25:42