简体   繁体   English

为什么火花继续重新计算RDD?

[英]Why spark keeps on recomputing an RDD?

I make an RDD using flatMap. 我使用flatMap制作RDD。 Later on I perform range partitioning of it. 稍后我会对它进行范围分区。 If I persist the original RDD, everything works fine. 如果我坚持原始RDD,一切正常。 However, If I don't cache it, the range partitioner part somehow wants to recalculate parts of the original RDD. 但是,如果我不缓存它,范围分区器部分想要重新计算原始RDD的一部分。 I understand if I don't have enough memory, but in this case, there is much more memory in my system than what the RDD occupies. 我知道如果我没有足够的内存,但在这种情况下,我的系统中的内存比RDD占用的内存多得多。 Secondly, the computations for that RDD are long, so this restarting/recomputing really hurts the performance. 其次,该RDD的计算很长,因此重新启动/重新计算确实会损害性能。 What could be the reason for this strange behavior? 这种奇怪行为的原因是什么?

PS I use the RDD only once. PS我只使用RDD一次。 So, this should not happen. 所以,这不应该发生。

This is just how Spark works: 这就是Spark的工作原理:

When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). 当您持久保存RDD时,每个节点都会存储它在内存中计算的任何分区,并在该数据集(或从中派生的数据集)的其他操作中重用它们。

So when you don't, it doesn't. 所以当你不这样做时,事实并非如此。 If you use an RDD more than once, and have enough memory, you generally want to persist it. 如果您多次使用RDD并且有足够的内存,通常需要保留它。

This can't be done automatically because Spark can't know if you are going to reuse the RDD: eg you can calculate an RDD, then sample it, and use the results to decide whether you want to do something else with the RDD, so whether RDD is used twice depends on random number generator. 这不能自动完成,因为Spark无法知道您是否要重用RDD:例如,您可以计算RDD,然后对其进行sample ,并使用结果来决定是否要对RDD执行其他操作,所以RDD是否被使用两次取决于随机数发生器。

If you didn't use RDD.cache, the RDD computing result would not be persist in memory.For example(there is a rdd data rdd_test) 如果您没有使用RDD.cache,则RDD计算结果将不会在内存中持久存在。例如(有一个rdd数据rdd_test)

val rdd_test: RDD[Int] = sc.makeRDD(Array(1,2,3), 1)
val a = rdd_test.map(_+1)
val b = a.map(_+1)

Now, a and b these three rdd data are not in memory. 现在, ab这三个rdd数据不在内存中。 So, when val c = b.map(_+1) , a and b will be recomputed. 因此,当val c = b.map(_+1) ,将重新计算ab if we use cache on a and b: 如果我们在a和b上使用缓存:

val rdd_test: RDD[Int] = sc.makeRDD(Array(1,2,3), 1)
val a = rdd_test.map(_+1).cache
val b = a.map(_+1).cache

Then val c = b.map(_+1) , a and b will not be recomputed. 然后val c = b.map(_+1)ab将不会被重新计算。

(Please note that: if there is not enough memory, cache method will fail, so a and b will be recompute. (请注意:如果内存不足, cache方法将失败,因此将重新计算ab

I'm not good at english, sorry. 对不起,我不擅长英语。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM