简体   繁体   English

这是在RDD上实现懒惰`take`的合适方法吗?

[英]Is this a suitable way to implement a lazy `take` on RDD?

It's quite unfortunate that take on RDD is a strict operation instead of lazy but I won't get into why I think that's a regrettable design here and now. 非常不幸的是, take RDD是一种严格的操作而不是懒惰,但我不会理解为什么我认为这是一个令人遗憾的设计现在和现在。

My question is whether this is a suitable implementation of a lazy take for RDD . 我的问题是,这是否是一个合适的实现了一个懒takeRDD It seems to work, but I might be missing some non-obvious problem with it. 它似乎工作,但我可能会错过一些非显而易见的问题。

def takeRDD[T: scala.reflect.ClassTag](rdd: RDD[T], num: Long): RDD[T] =
  new RDD[T](rdd.context, List(new OneToOneDependency(rdd))) {
    // An unfortunate consequence of the way the RDD AST is designed
    var doneSoFar = 0L

    def isDone = doneSoFar >= num

    override def getPartitions: Array[Partition] = rdd.partitions

    // Should I do this? Doesn't look like I need to
    // override val partitioner = self.partitioner

    override def compute(split: Partition, ctx: TaskContext): Iterator[T] = new Iterator[T] {
      val inner = rdd.compute(split, ctx)

      override def hasNext: Boolean = !isDone && inner.hasNext

      override def next: T = {
        doneSoFar += 1
        inner.next
      }
    }
  }

Answer to your question 回答你的问题

No, this doesn't work. 不,这不起作用。 There's no way to have a variable which can be seen and updated concurrently across a Spark cluster, and that's exactly what you're trying to use doneSoFar as. 没有办法让一个变量可以在Spark集群中同时查看和更新​​,而这正是你尝试使用doneSoFar If you try this, then when you run compute (in parallel across many nodes), you 如果您尝试这样做,那么当您运行compute (并行跨多个节点)时,您就可以了

a) serialize the takeRDD in the task, because you reference the class variable doneSoFar . a)在任务中序列化takeRDD,因为你引用了类变量doneSoFar This means that you write the class to bytes and make a new instance in each JVM (executor) 这意味着您将类写入字节并在每个JVM(执行程序)中创建一个新实例

b) update doneSoFar in compute, which updates the local instance on each executor JVM. b)更新doneSoFar中的doneSoFar ,它更新每个执行程序JVM上的本地实例。 You'll take a number of elements from each partition equal to num . 你将从每个分区中取出一些等于num的元素。

It's possible this will work in Spark local mode due to some of the JVM properties there, but it CERTAINLY will not work when running Spark in cluster mode. 由于某些JVM属性,它可能在Spark本地模式下工作,但在群集模式下运行Spark时,它很可能无法工作。

Why take is an action, not transformation 为什么take行动,而不是转型

RDDs are distributed, and so subsetting to an exact number of elements is an inefficient operation -- it can't be done totally in parallel, since each shard needs information about the other shards (like whether it should be computed at all). RDD是分布式的,因此对精确数量的元素进行子集化是一种低效的操作 - 它不能完全并行完成,因为每个分片都需要有关其他分片的信息(比如它是否应该完全计算)。 Take is great for bringing distributed data back into local memory. Take非常适合将分布式数据带回本地内存。

rdd.sample is a similar operation that stays in the distributed world, and can be run in parallel easily. rdd.sample是一种类似的操作,它保留在分布式世界中,并且可以轻松地并行运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM