简体繁体 English

Spark - StorageLevel (DISK_ONLY vs MEMORY_AND_DISK) 和内存不足的 Java 堆空间

[英]Spark - StorageLevel (DISK_ONLY vs MEMORY_AND_DISK) and Out of memory Java heap space

原文 2017-09-27 23:03:37 1 2 scala/ apache-spark/ caching/ memory/ rdd

Lately I've been running a memory-heavy spark job and started to wonder about storage levels of spark.最近，我一直在运行一个占用大量内存的 spark 作业，并开始怀疑 spark 的存储级别。 I persisted one of my RDDs as it was used twice using StorageLevel.MEMORY_AND_DISK .我保留了我的一个 RDD，因为它使用StorageLevel.MEMORY_AND_DISK使用了两次。 I was getting OOM Java heap space during the job.我在工作期间获得了 OOM Java 堆空间。 Then, when I removed the persist completely, the job has managed to go through and finish.然后，当我完全删除持久性时，该工作已成功完成并完成。

I always thought that the MEMORY_AND_DISK is basically a fully safe option - if you run out of memory, it spills the object to disk, done.我一直认为MEMORY_AND_DISK基本上是一个完全安全的选项 - 如果内存不足，它会将对象溢出到磁盘，完成。 But now it seemed that it did not really work in the way I expected it to.但现在看来，它并没有像我预期的那样真正发挥作用。

This derives two questions:这衍生出两个问题：

If MEMORY_AND_DISK spills the objects to disk when executor goes out of memory, does it ever make sense to use DISK_ONLY mode (except some very specific configurations like spark.memory.storageFraction=0 )?如果MEMORY_AND_DISK在 executor 内存不足时将对象溢出到磁盘，使用DISK_ONLY模式是否有意义（除了一些非常具体的配置，如spark.memory.storageFraction=0 ）？
If MEMORY_AND_DISK spills the objects to disk when executor goes out of memory, how could I fix the problem with OOM by removing the caching?如果MEMORY_AND_DISK在 executor 内存不足时将对象溢出到磁盘，我如何通过删除缓存来解决 OOM 的问题？ Did I miss something and the problem was actually elsewhere?我是否错过了什么，而问题实际上出在其他地方？

2 个解决方案

So, after few years ;) that's what I believe happened:所以，几年后 ;) 这就是我认为发生的事情：

Caching is not a way to save execution memory.缓存不是一种节省执行内存的方法。 The best you can do is not to lose execution memory ( DISK_ONLY ) when caching.您能做的最好的事情是在缓存时不要丢失执行内存（ DISK_ONLY ）。
It's most likely the lack of execution memory that caused my job to throw OOM error, although I don't remember the actual use case.尽管我不记得实际用例，但很可能是由于缺少执行内存导致我的工作抛出 OOM 错误。
I used MEMORY_AND_DISK caching and the MEMORY part took its part from the unified region which made it impossible for my job to finish (since the Execution = Unified - Storage memory was not enough to perform the job)我使用MEMORY_AND_DISK缓存，并且MEMORY部分从统一区域中MEMORY_AND_DISK作用，这使我的工作无法完成（因为Execution = Unified - Storage内存不足以执行工作）
Due to above, when I removed caching at all, it took slower, but the job had enough execution memory to finish.由于上述原因，当我完全删除缓存时，速度变慢了，但作业有足够的执行内存来完成。
With DISK_ONLY caching it seems that the job would therefore finish as well (although not necessarily faster).使用DISK_ONLY缓存，似乎作业也会完成（尽管不一定更快）。

https://spark.apache.org/docs/latest/tuning.html#memory-management-overview https://spark.apache.org/docs/latest/tuning.html#memory-management-overview

MEMORY_AND_DISK doesn't "spill the objects to disk when executor goes out of memory". MEMORY_AND_DISK不会“在执行程序内存不足时将对象溢出到磁盘”。 It tells Spark to write partitions not fitting in memory to Disk so they will be loaded from there when needed.它告诉 Spark 将不适合内存的分区写入磁盘，以便在需要时从那里加载它们。

Dealing with huge datasets you should definately consider persisting data to DISK_ONLY.处理庞大的数据集时，您绝对应该考虑将数据持久化到 DISK_ONLY。 https://spark.apache.org/docs/latest/rdd-programming-guide.html#which-storage-level-to-choose https://spark.apache.org/docs/latest/rdd-programming-guide.html#which-storage-level-to-choose