[英]Is this a suitable way to implement a lazy `take` on RDD?
It's quite unfortunate that take
on RDD
is a strict operation instead of lazy but I won't get into why I think that's a regrettable design here and now. 非常不幸的是,
take
RDD
是一种严格的操作而不是懒惰,但我不会理解为什么我认为这是一个令人遗憾的设计现在和现在。
My question is whether this is a suitable implementation of a lazy take
for RDD
. 我的问题是,这是否是一个合适的实现了一个懒
take
的RDD
。 It seems to work, but I might be missing some non-obvious problem with it. 它似乎工作,但我可能会错过一些非显而易见的问题。
def takeRDD[T: scala.reflect.ClassTag](rdd: RDD[T], num: Long): RDD[T] =
new RDD[T](rdd.context, List(new OneToOneDependency(rdd))) {
// An unfortunate consequence of the way the RDD AST is designed
var doneSoFar = 0L
def isDone = doneSoFar >= num
override def getPartitions: Array[Partition] = rdd.partitions
// Should I do this? Doesn't look like I need to
// override val partitioner = self.partitioner
override def compute(split: Partition, ctx: TaskContext): Iterator[T] = new Iterator[T] {
val inner = rdd.compute(split, ctx)
override def hasNext: Boolean = !isDone && inner.hasNext
override def next: T = {
doneSoFar += 1
inner.next
}
}
}
No, this doesn't work. 不,这不起作用。 There's no way to have a variable which can be seen and updated concurrently across a Spark cluster, and that's exactly what you're trying to use
doneSoFar
as. 没有办法让一个变量可以在Spark集群中同时查看和更新,而这正是你尝试使用
doneSoFar
。 If you try this, then when you run compute
(in parallel across many nodes), you 如果您尝试这样做,那么当您运行
compute
(并行跨多个节点)时,您就可以了
a) serialize the takeRDD in the task, because you reference the class variable doneSoFar
. a)在任务中序列化takeRDD,因为你引用了类变量
doneSoFar
。 This means that you write the class to bytes and make a new instance in each JVM (executor) 这意味着您将类写入字节并在每个JVM(执行程序)中创建一个新实例
b) update doneSoFar
in compute, which updates the local instance on each executor JVM. b)更新
doneSoFar
中的doneSoFar
,它更新每个执行程序JVM上的本地实例。 You'll take a number of elements from each partition equal to num
. 你将从每个分区中取出一些等于
num
的元素。
It's possible this will work in Spark local mode due to some of the JVM properties there, but it CERTAINLY will not work when running Spark in cluster mode. 由于某些JVM属性,它可能在Spark本地模式下工作,但在群集模式下运行Spark时,它很可能无法工作。
take
is an action, not transformation take
行动,而不是转型 RDDs are distributed, and so subsetting to an exact number of elements is an inefficient operation -- it can't be done totally in parallel, since each shard needs information about the other shards (like whether it should be computed at all). RDD是分布式的,因此对精确数量的元素进行子集化是一种低效的操作 - 它不能完全并行完成,因为每个分片都需要有关其他分片的信息(比如它是否应该完全计算)。 Take is great for bringing distributed data back into local memory.
Take非常适合将分布式数据带回本地内存。
rdd.sample
is a similar operation that stays in the distributed world, and can be run in parallel easily. rdd.sample
是一种类似的操作,它保留在分布式世界中,并且可以轻松地并行运行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.