简体繁体中英

Spark-1.6.0+: spark.shuffle.memoryFraction deprecated - When will spill happen?

原文 2016-05-06 15:06:13 7 1 performance/ apache-spark/ shuffle

Based on recent version of Spark , the shuffle behavior has changed a lot.

Question : The SparkUI has stopped showing whether spill happened or not (& how much). In one of my experiment, I tried to simulate a situation where the shuffle write on an executor would be more than “JVM Heap Size” * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction (based on article ) but did not see any relevant disk-spill logs. Is there a way to get this information?

PS : Please excuse if this sounds theoretical question.

1 answers

With Spark 1.6.0 , the memory management system was updated. In short, there is no longer dedicated cache/shuffle memory. All memory can be used for either operation. From the release notes

Automatic memory management: Another area of performance gains in Spark 1.6 comes from better memory management. Before Spark 1.6, Spark statically divided the available memory into two regions: execution memory and cache memory. Execution memory is the region that is used in sorting, hashing, and shuffling, while cache memory is used to cache hot data. Spark 1.6 introduces a new memory manager that automatically tunes the size of different memory regions. The runtime automatically grows and shrinks regions according to the needs of the executing application. For many applications, this will mean a significant increase in available memory that can be used for operators such as joins and aggregations, without any user tuning.

This jira ticket gives background reasoning for the change and this paper discuses the new memory management system in depth.

Spark Shuffle Memory Overhead Issues

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

Spark scalability

spark reduceByKey performance/complexity when reducing lists with scala

Spark very slow performance when processing big input

Apache Spark: How to structure code of a Spark Application (especially when using Broadcasts)

when using spark as just etl process, what is faster between rdd and dataset in Spark2.1?

spark streaming throughput monitoring

Kafka + Spark scalability

Spark Performance Monitoring

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Spark Shuffle Memory Overhead Issues What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? Spark scalability spark reduceByKey performance/complexity when reducing lists with scala Spark very slow performance when processing big input Apache Spark: How to structure code of a Spark Application (especially when using Broadcasts) when using spark as just etl process, what is faster between rdd and dataset in Spark2.1? spark streaming throughput monitoring Kafka + Spark scalability Spark Performance Monitoring

Related Tags

Spark-1.6.0+: spark.shuffle.memoryFraction deprecated - When will spill happen?

Question

1 answers

solution1 4 ACCPTED 2016-05-06 15:36:39

solution1
4 ACCPTED 2016-05-06 15:36:39