这是在RDD上实现懒惰`take`的合适方法吗？

Question

It's quite unfortunate that take on RDD is a strict operation instead of lazy but I won't get into why I think that's a regrettable design here and now. 非常不幸的是， take RDD是一种严格的操作而不是懒惰，但我不会理解为什么我认为这是一个令人遗憾的设计现在和现在。

My question is whether this is a suitable implementation of a lazy take for RDD . 我的问题是，这是否是一个合适的实现了一个懒take的RDD 。 It seems to work, but I might be missing some non-obvious problem with it. 它似乎工作，但我可能会错过一些非显而易见的问题。

def takeRDD[T: scala.reflect.ClassTag](rdd: RDD[T], num: Long): RDD[T] =
  new RDD[T](rdd.context, List(new OneToOneDependency(rdd))) {
    // An unfortunate consequence of the way the RDD AST is designed
    var doneSoFar = 0L

    def isDone = doneSoFar >= num

    override def getPartitions: Array[Partition] = rdd.partitions

    // Should I do this? Doesn't look like I need to
    // override val partitioner = self.partitioner

    override def compute(split: Partition, ctx: TaskContext): Iterator[T] = new Iterator[T] {
      val inner = rdd.compute(split, ctx)

      override def hasNext: Boolean = !isDone && inner.hasNext

      override def next: T = {
        doneSoFar += 1
        inner.next
      }
    }
  }

Answer 1

Answer to your question 回答你的问题

No, this doesn't work. 不，这不起作用。 There's no way to have a variable which can be seen and updated concurrently across a Spark cluster, and that's exactly what you're trying to use doneSoFar as. 没有办法让一个变量可以在Spark集群中同时查看和更新，而这正是你尝试使用doneSoFar 。 If you try this, then when you run compute (in parallel across many nodes), you 如果您尝试这样做，那么当您运行compute （并行跨多个节点）时，您就可以了

a) serialize the takeRDD in the task, because you reference the class variable doneSoFar . a）在任务中序列化takeRDD，因为你引用了类变量doneSoFar 。 This means that you write the class to bytes and make a new instance in each JVM (executor) 这意味着您将类写入字节并在每个JVM（执行程序）中创建一个新实例

b) update doneSoFar in compute, which updates the local instance on each executor JVM. b）更新doneSoFar中的doneSoFar ，它更新每个执行程序JVM上的本地实例。 You'll take a number of elements from each partition equal to num . 你将从每个分区中取出一些等于num的元素。

It's possible this will work in Spark local mode due to some of the JVM properties there, but it CERTAINLY will not work when running Spark in cluster mode. 由于某些JVM属性，它可能在Spark本地模式下工作，但在群集模式下运行Spark时，它很可能无法工作。

Why `take` is an action, not transformation 为什么`take`行动，而不是转型

RDDs are distributed, and so subsetting to an exact number of elements is an inefficient operation -- it can't be done totally in parallel, since each shard needs information about the other shards (like whether it should be computed at all). RDD是分布式的，因此对精确数量的元素进行子集化是一种低效的操作 - 它不能完全并行完成，因为每个分片都需要有关其他分片的信息（比如它是否应该完全计算）。 Take is great for bringing distributed data back into local memory. Take非常适合将分布式数据带回本地内存。

rdd.sample is a similar operation that stays in the distributed world, and can be run in parallel easily. rdd.sample是一种类似的操作，它保留在分布式世界中，并且可以轻松地并行运行。

这是在RDD上实现懒惰`take`的合适方法吗？

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-12-01 23:56:27

Answer to your question 回答你的问题

Why `take` is an action, not transformation 为什么`take`行动，而不是转型

这是在RDD上实现懒惰`take`的合适方法吗？

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-12-01 23:56:27

Answer to your question 回答你的问题

Why take is an action, not transformation 为什么take行动，而不是转型

解决方案1
3 已采纳 2016-12-01 23:56:27

Why `take` is an action, not transformation 为什么`take`行动，而不是转型