简体   繁体   English

如何保持 RDD 的持久性和一致性?

[英]How to keep RDD persisted and consistent?

I have the following code (simplification for a complex situation):我有以下代码(复杂情况的简化):

val newRDD = prevRDD.flatMap{a =>
    Array.fill[Int](scala.util.Random.nextInt(10)){scala.util.Random.nextInt(2)})
}.persist()
val a = newRDD.count
val b = newRDD.count

and even that the RDD supposed to be persisted (and therefore consistent), a and b are not identical in most cases.即使 RDD 应该被持久化(因此是一致的),在大多数情况下ab并不相同。

Is there a way to keep the results of the first action consistent, so when the second "action" will be called, the results of the first action will be returned?有没有办法让第一个动作的结果保持一致,那么当调用第二个“动作”时,会返回第一个动作的结果?

* Edit * * 编辑 *

The issue that I have is apparently caused by zipWithIndex method exists in my code - which creates indices higher than the count.我遇到的问题显然是由我的代码中存在的zipWithIndex方法引起的 - 它创建的索引高于计数。 I'll ask about it in a different thread.我会在不同的线程中询问它。 Thanks谢谢

There is no way to make sure 100% consistent.没有办法确保 100% 一致。

When you call persist it will try to cache all of partitions on memory if it fits.当您调用persist它会尝试在内存中缓存所有分区(如果合适)。 Otherwise, It will recompute partitions which are not fit on memory.否则,它将重新计算不适合内存的分区。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM