具有32GB或更多內存的spark worker遇到致命錯誤

Question

我在一個獨立的Spark Cluster中有三個slave。 每個從站有48GB的RAM。 當我為執行程序分配超過31GB（例如32GB或更多）的RAM時：

.config("spark.executor.memory", "44g")

在連接兩個大型Dataframe期間，執行程序終止時沒有太多信息。 Slave驅動程序的輸出消息顯示“缺少shuffle的輸出位置”：

17/09/21 12:34:18 INFO StandaloneSchedulerBackend: Granted executor ID app-20170921123240-0000/3 on hostPort XXX.XXX.XXX.92:33705 with 6 cores, 44.0 GB RAM
17/09/21 12:34:18 WARN TaskSetManager: Lost task 14.0 in stage 7.0 (TID 124, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 5.0 in stage 7.0 (TID 115, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 17.0 in stage 7.0 (TID 127, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 8.0 in stage 7.0 (TID 118, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 2.0 in stage 7.0 (TID 112, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 WARN TaskSetManager: Lost task 11.0 in stage 7.0 (TID 121, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/09/21 12:34:18 INFO DAGScheduler: Executor lost: 0 (epoch 5)
17/09/21 12:34:18 INFO BlockManagerMaster: Removal of executor 0 requested
17/09/21 12:34:18 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_2 !
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_11 !
17/09/21 12:34:18 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170921123240-0000/3 is now RUNNING
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_5 !
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_8 !
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(0, XXX.XXX.XXX, 34840, None)
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/09/21 12:34:18 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor

Spark Master的日志消息顯示執行者是“EXITED”然后重新啟動：

17/09/21 12:34:18 INFO Master: Removing executor app-20170921123240-0000/0 because it is EXITED
17/09/21 12:34:18 INFO Master: Launching executor app-20170921123240-0000/3 on worker worker-20170921123014-152.83.247.92-33705

Spark Worker的日志消息顯示執行程序退出代碼134

17/09/21 12:34:18 INFO Worker: Executor app-20170921123240-0000/0 finished with state EXITED message Command exited with code 134 exitStatus 134

唯一的線索似乎是在應用程序的錯誤日志中，顯示JRE已檢測到致命錯誤：

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fdec0c92a73, pid=11300, tid=0x00007fd3a6951700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build 1.8.0_131-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode linux-amd64 )
# Problematic frame:
# V  [libjvm.so+0x3ffa73]  CardTableExtension::scavenge_contents_parallel(ObjectStartArray*, MutableSpace*, HeapWord*, PSPromotionManager*, unsigned int, unsigned int)+0x5e3
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

---------------  T H R E A D  ---------------

Current thread (0x0000000001c9e800):  GCTaskThread [stack: 0x00007fd3a6851000,0x00007fd3a6952000] [id=11308]

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000008

只要我為每個執行程序分配31GB的RAM（或更少），我的程序就可以正常工作。 以前有人遇到過這樣的問題嗎？

Answer 1

由於Java存儲對象引用的方式，44 GB實際上可以提供比31 GB更小的可用堆：對於超過32 GB的堆大小，JVM必須切換到64位對象引用，這意味着所有對象占用更多空間。 更多細節： http ： //java-performance.info/over-32g-heap-java/

我的經驗法則是保持低於32 GB或更高（例如，50 GB）。 通常，使用多個JVM（每個JVM堆的容量小於32 GB）更具成本效益。 使用48 GB RAM我會堅持31 GB堆。

具有32GB或更多內存的spark worker遇到致命錯誤

問題描述

1 個解決方案

解決方案1
5 已采納 2018-03-08 10:24:28

具有32GB或更多內存的spark worker遇到致命錯誤

問題描述

1 個解決方案

解決方案1 5 已采納 2018-03-08 10:24:28

解決方案1
5 已采納 2018-03-08 10:24:28