如何对每个分区中元素数量不同的两个RDD执行类似zip的操作？

Question

我正在使用Spark 1.1.0。

我有两个类型为JavaRDD<IndividualBean> RDD firstSample和secondSample 。 这些RDD的内容如下：

[
IndividualBean [params={...}], 
IndividualBean [params={...}], 
IndividualBean [params={...}]
]

[
IndividualBean [params={...}], 
IndividualBean [params={...}], 
IndividualBean [params={...}]
]

当我尝试将它们zip在一起时，出现以下错误：

在每个分区中只能压缩具有相同数量元素的RDD

我猜这是因为我的RDD没有相同数量的分区，或者每个分区中元素的数量相同。

我想对这些RDD执行一项操作，该操作将给我与zip相同的结果。

现在，我找到了以下解决方案（ totalSize变量只是firstSample.union(secondSample)的大小）：

JavaPairRDD<IndividualBean, IndividualBean> zipped = firstSample.union(secondSample).zipWithIndex().mapToPair(
            new PairFunction<Tuple2<IndividualBean,Long>, Long, IndividualBean>() {
                @Override
                public Tuple2<Long, IndividualBean> call(
                        Tuple2<IndividualBean, Long> tuple) throws Exception {
                    return new Tuple2<Long, IndividualBean>(tuple._2, tuple._1);
                }
    }).groupBy(new Function<Tuple2<Long,IndividualBean>, Long>() {
        @Override
        public Long call(Tuple2<Long, IndividualBean> tuple) throws Exception {
            long index = tuple._1.longValue();
            if(index < totalSize/2){
                return index+totalSize/2;
            }
            return index;
        }
    }).values().mapToPair(new PairFunction<Iterable<Tuple2<Long, IndividualBean>>, IndividualBean, IndividualBean>() {
        @Override
        public Tuple2<IndividualBean, IndividualBean> call(
                Iterable<Tuple2<Long, IndividualBean>> iterable) throws Exception {
            Iterator<Tuple2<Long, IndividualBean>> it = iterable.iterator();
            IndividualBean firstBean = it.next()._2;
            IndividualBean secondBean = it.next()._2;
            return new Tuple2<IndividualBean, IndividualBean>(firstBean, secondBean);
        }
    });

但这很昂贵，因为它涉及改组。

有什么更好的方法可以做到这一点？

Answer 1

Scala中的解决方案，因为这就是我完成所有Spark编程的方式。

该解决方案的关键是始终保持相同的分区方案，然后将各个分区压缩在一起。 为了达到这个目的，该解决方案在采样时变得既快速又宽松。 特别是，与每个随机选择的点配对的数据点为：

从同一分区选择
不是随机选择的（实际上往往是原始RDD中紧随其后的）

这些简化中的第一个对解决方案至关重要。 通过向下面定义的zipFunc添加一些代码以重新排序zip的一侧，可以删除第二个代码。

重要的是要了解zipFunc的功能：我正在将样本及其补全压缩在一起，而它们的大小甚至都不相同。 我简单地压缩了两个RDD中相应分区的内容，即使它们没有相同数量的样本：当我在zip的一侧用尽样本时，我只将其放在另一侧。

val testRDD = sc.parallelize(1 to 1000, 4)

val firstSample = testRDD.sample(false, 0.4)
val remaining = testRDD.subtract(firstSample)

def zipFunc(l: Iterator[Int], r: Iterator[Int]) : Iterator[(Int,Int)] = {
  val res = new ListBuffer[(Int, Int)]
  // exercise for the reader: suck either l or r into a container before iterating 
  // and consume it in random order to achieve more random pairing if desired
  while (l.hasNext && r.hasNext) {
    res += ((l.next(), r.next()))
  }
  res.iterator
}
// notice the `true` to make sure partitioning is preserved
val pairs:RDD[(Int,Int)] = firstSample.zipPartitions(remaining, true)(zipFunc)

据我所知，这不需要跨分区通信。 这确实取决于您从各个分区中均匀地获取了sample() ，并且根据我的经验， sample()方法在这方面还不错。

如何对每个分区中元素数量不同的两个RDD执行类似zip的操作？

问题描述

1 个解决方案

解决方案1
0 2014-09-29 15:36:25

如何对每个分区中元素数量不同的两个RDD执行类似zip的操作？

问题描述

1 个解决方案

解决方案1 0 2014-09-29 15:36:25

解决方案1
0 2014-09-29 15:36:25