简体   繁体   中英

Spark-1.6.0+: spark.shuffle.memoryFraction deprecated - When will spill happen?

Based on recent version of Spark , the shuffle behavior has changed a lot.

Question : The SparkUI has stopped showing whether spill happened or not (& how much). In one of my experiment, I tried to simulate a situation where the shuffle write on an executor would be more than “JVM Heap Size” * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction (based on article ) but did not see any relevant disk-spill logs. Is there a way to get this information?

PS : Please excuse if this sounds theoretical question.

With Spark 1.6.0 , the memory management system was updated. In short, there is no longer dedicated cache/shuffle memory. All memory can be used for either operation. From the release notes

Automatic memory management: Another area of performance gains in Spark 1.6 comes from better memory management. Before Spark 1.6, Spark statically divided the available memory into two regions: execution memory and cache memory. Execution memory is the region that is used in sorting, hashing, and shuffling, while cache memory is used to cache hot data. Spark 1.6 introduces a new memory manager that automatically tunes the size of different memory regions. The runtime automatically grows and shrinks regions according to the needs of the executing application. For many applications, this will mean a significant increase in available memory that can be used for operators such as joins and aggregations, without any user tuning.

This jira ticket gives background reasoning for the change and this paper discuses the new memory management system in depth.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM