简体   繁体   English

Spark作业抛出“ java.lang.OutOfMemoryError:超出了GC开销限制”

[英]Spark job throwing “java.lang.OutOfMemoryError: GC overhead limit exceeded”

I have a Spark job that throws "java.lang.OutOfMemoryError: GC overhead limit exceeded". 我有一个Spark作业,抛出“ java.lang.OutOfMemoryError:超出了GC开销限制”。

The job is trying to process a filesize 4.5G. 这项工作正在尝试处理文件大小为4.5G的文件。

I've tried following spark configuration: 我尝试了以下火花配置:

--num-executors 6  --executor-memory 6G --executor-cores 6 --driver-memory 3G 

I tried increasing more cores and executor which sometime works, but takes over 20 minutes to process the file. 我尝试增加更多的内核和执行器,但有时会起作用,但是要花20多分钟来处理文件。

Could I do something to improve the performance? 我可以做些改善性能的事情吗? or stop the Java Heap issue? 还是停止Java Heap问题?

Only solution is to fine tune the configuration. 唯一的解决方案是微调配置。

As per my experience I can say the following points for OOM: 根据我的经验,我可以对OOM说以下几点:

  • cache an RDD only if you are going to use it more than once 仅在您打算多次使用RDD时才对其进行缓存

Still if you need to cache then consider then analyze the data and application with respect to resources. 尽管如此,如果您需要缓存,请考虑然后针对资源分析数据和应用程序。

  • If your cluster has enough memory then increase the spark.executor.memory to its max 如果您的集群有足够的内存,则将spark.executor.memory增加到最大
  • Increase the no of partitions to increase the parallelism 增加分区数以增加并行度
  • Increase the dedicated memory for caching spark.storage.memoryFraction . 增加用于缓存spark.storage.memoryFraction的专用内存。 If lot of shuffle memory is involved then try to avoid or split the allocation carefully 如果涉及大量随机播放内存,请尝试避免或仔细分配内存
  • Spark's caching feature Persist(MEMORY_AND_DISK) is available at the cost of additional processing (serializing, writing and reading back the data). 可以使用Spark的缓存功能Persist(MEMORY_AND_DISK),但要付出额外的处理(序列化,写和读回数据)。 Usually CPU usage will be too high in this case 在这种情况下,通常CPU使用率会过高
  1. You can try increasing the driver-memory . 您可以尝试增加驱动程序内存 If you don't have enough memory may be you can reduce it from executor-memory 如果您没有足够的内存,则可以从executor-memory中减少它

  2. Check the spark-ui to see what is the scheduler delay. 检查spark-ui以查看调度程序延迟是多少。 You can access the spark UI on port 4040. If the scheduler delay is high, quite often, the driver may be shipping large amount of data to the executors. 您可以在端口4040上访问spark UI。如果调度程序延迟很高,那么驱动程序可能经常将大量数据发送给执行程序。 Which needs to be fixed. 哪些需要修复。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark失败了java.lang.OutOfMemoryError:超出了GC开销限制? - Spark fails with java.lang.OutOfMemoryError: GC overhead limit exceeded? SPARK SQL java.lang.OutOfMemoryError:超出GC开销限制 - SPARK SQL java.lang.OutOfMemoryError: GC overhead limit exceeded Java PreparedStatement java.lang.OutOfMemoryError:超出了GC开销限制 - Java PreparedStatement java.lang.OutOfMemoryError: GC overhead limit exceeded 詹金斯 java.lang.OutOfMemoryError:超出 GC 开销限制 - Jenkins java.lang.OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError:GC开销限制超出了android studio - java.lang.OutOfMemoryError: GC overhead limit exceeded android studio Gridgain:java.lang.OutOfMemoryError:超出了GC开销限制 - Gridgain: java.lang.OutOfMemoryError: GC overhead limit exceeded SonarQube java.lang.OutOfMemoryError:超出了GC开销限制 - SonarQube java.lang.OutOfMemoryError: GC overhead limit exceeded Tomcat java.lang.OutOfMemoryError:超出了GC开销限制 - Tomcat java.lang.OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError:超出 GC 开销限制 - java.lang.OutOfMemoryError: GC overhead limit exceeded 超出Junit java.lang.OutOfMemoryError GC开销限制 - Junit java.lang.OutOfMemoryError GC overhead limit exceeded
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM