简体   繁体   English

RDD的垃圾收集

[英]Garbage Collection of RDDs

I have a fundamental question in spark. 我在火花中有一个基本问题。 Spark maintains lineage of RDDs to recalculate in case few RDDs get corrupted. 如果少数RDD被破坏,Spark会维护RDD的谱系以重新计算。 So JVM cannot find it as orphan objects. 所以JVM无法将其作为孤儿对象找到。 Then how and when the garbage collection of RDDs happen? 然后如何以及何时发生RDD的垃圾收集?

The memory for RDD storage can be configured using 可以使用配置RDD存储的内存

"spark.storage.memoryFracion" property.

If this limit exceeded, older partitions will be dropped from memory. 如果超出此限制,旧内存分区将从内存中删除。

We can set it as a value between 0 and 1, describing what portion of executor JVM memory will be dedicated for caching RDDs. 我们可以将它设置为介于0和1之间的值,描述执行程序JVM内存的哪一部分专用于缓存RDD。 By default value is 0.66 默认值为0.66

Suppose if we have 2 GB memory, then we will get 0.4 * 2g memory for your heap and 0.66 * 2g for RDD storage by default. 假设我们有2 GB内存,那么我们将为你的堆获得0.4 * 2g内存,默认情况下为RDD存储获得0.66 * 2g。

We can configure Spark properties to print more details about GC is behaving: 我们可以配置Spark属性来打印有关GC的更多详细信息:

Set spark.executor.extraJavaOptions to include 设置要包含的spark.executor.extraJavaOptions

“-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps”

In case your tasks slow down and you find that your JVM is garbage-collecting frequently or running out of memory, lowering " spark.storage.memoryFracion " value will help reduce the memory consumption. 如果您的任务变慢并且您发现JVM经常进行垃圾收集或内存spark.storage.memoryFracion ,则降低“ spark.storage.memoryFracion ”值将有助于减少内存消耗。

For more details,have a look at below reference: 有关详细信息,请参阅以下参考:

http://spark.apache.org/docs/1.2.1/tuning.html#garbage-collection-tuning http://spark.apache.org/docs/1.2.1/tuning.html#garbage-collection-tuning

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM