Spark不断向死执行者提交任务

Question

i am working on apache spark and i am facing a very weird issue.我正在研究 apache spark，但我正面临一个非常奇怪的问题。 One of the executors fail because of OOM and its shut down hooks clear all the storage (memory and disk) but apparently driver keeps submitting the failed tasks on the same executor due to PROCESS_LOCAL tasks.其中一个执行程序因 OOM 而失败，其关闭挂钩清除了所有存储（内存和磁盘），但显然由于 PROCESS_LOCAL 任务，驱动程序一直在同一执行程序上提交失败的任务。

Now that the storage on that machine is cleared, all the retried tasks also fail causing the whole stage to fail (after 4 retries)现在该机器上的存储已清除，所有重试任务也失败导致整个阶段失败（重试 4 次后）

What i don't understand is, how couldn't driver know that the executor is in shut down and will not be able to execute any tasks.我不明白的是，驱动程序怎么可能不知道执行程序处于关闭状态并且无法执行任何任务。

Configuration:配置：

heartbeat interval: 60s心跳间隔：60s
Network timeout: 600s网络超时：600s

Logs to confirm that executor is accepting tasks post shutdown日志以确认执行程序正在接受关闭后的任务

20/09/29 20:26:32 ERROR [Executor task launch worker for task 138513] Executor: Exception in task 6391.0 in stage 17.0 (TID 138513)
java.lang.OutOfMemoryError: Java heap space
20/09/29 20:26:32 ERROR [Executor task launch worker for task 138513] SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 138513,5,main]
java.lang.OutOfMemoryError: Java heap space
20/09/29 20:26:32 INFO [pool-8-thread-1] DiskBlockManager: Shutdown hook called
20/09/29 20:26:35 ERROR [Executor task launch worker for task 138295] Executor: Exception in task 6239.0 in stage 17.0 (TID 138295)
java.io.FileNotFoundException: /storage/1/spark/spark-ba168da6-dc11-4e15-bd95-1e58198c81e7/executor-8dea198c-741a-4733-8fbb-df57241acdd5/blockmgr-1fc6b30a-c24e-4bb2-a133-5e411cef810f/35/temp_shuffle_b5df90ac-78de-48e3-9c2d-891f8b2ce1fa (No such file or directory)
20/09/29 20:26:36 ERROR [Executor task launch worker for task 139484] Executor: Exception in task 6587.0 in stage 17.0 (TID 139484)
org.apache.spark.SparkException: Block rdd_3861_6587 was not found even though it's read-locked
20/09/29 20:26:42 WARN [Thread-2] ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException: null
    at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[?:1.8.0_172]
20/09/29 20:26:44 ERROR [Executor task launch worker for task 140256] Executor: Exception in task 6576.3 in stage 17.0 (TID 140256)
java.io.FileNotFoundException: /storage/1/spark/spark-ba168da6-dc11-4e15-bd95-1e58198c81e7/executor-8dea198c-741a-4733-8fbb-df57241acdd5/blockmgr-1fc6b30a-c24e-4bb2-a133-5e411cef810f/30/rdd_3861_6576 (No such file or directory)
20/09/29 20:26:44 INFO [dispatcher-event-loop-0] Executor: Executor is trying to kill task 6866.1 in stage 17.0 (TID 140329), reason: stage cancelled
20/09/29 20:26:47 INFO [pool-8-thread-1] ShutdownHookManager: Shutdown hook called
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: stopping client from cache: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: stopping client from cache: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: removing client from cache: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: stopping actual client because no more references remain: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: Stopping client
20/09/29 20:26:47 DEBUG [Thread-2] ShutdownHookManager: ShutdownHookManger complete shutdown
20/09/29 20:26:55 INFO [dispatcher-event-loop-14] CoarseGrainedExecutorBackend: Got assigned task 141510
20/09/29 20:26:55 INFO [Executor task launch worker for task 141510] Executor: Running task 545.1 in stage 26.0 (TID 141510)

(I have trimmed down the stack traces as those are just spark RDD shuffle read methods) （我已经修剪了堆栈跟踪，因为那些只是 spark RDD shuffle 读取方法）

If we check the timestamps, then shutdown started at 20/09/29 20:26:32 and ended at 20/09/29 20:26:47 , in between this duration, driver sent all the retried tasks to the same executor and they all failed causing stage cancellation.如果我们检查时间戳，然后关机开始在20/09/29 20:26:32 ，结束时为20/09/29 20:26:47 ，在此期间之间，驱动程序发送的所有重试的任务相同的执行和他们都失败了，导致舞台取消。

can someone help me understand this behavior ?有人可以帮我理解这种行为吗？ Also let me know if anything else is required如果还需要其他任何东西，请告诉我

Answer 1

there is a configuration in spark, spark.task.maxFailures default is 4. so spark retry the task when it's failed. spark中有一个配置，spark.task.maxFailures默认为4。所以spark在任务失败时重试任务。 and the Taskrunner will update the task status to driver.并且 Taskrunner 会将任务状态更新到驱动程序。 and driver will forward this status to taskschedulerimpl.并且驱动程序将此状态转发给taskschedulerimpl。 in your case, only executor OOM and DiskBlockManager shutdown, but the driver is alive.在您的情况下，只有执行程序 OOM 和 DiskBlockManager 关闭，但驱动程序还活着。 i think the executor is alive as well.我认为执行者也还活着。 and will retry the task with same taskSetManager.并将使用相同的 taskSetManager 重试任务。 the failed times of this task arrive 4 times, this stage will be cancelled, and executor will be killed.此任务的失败次数达到 4 次，此阶段将被取消，执行者将被杀死。

Spark不断向死执行者提交任务

问题描述

1 个解决方案

解决方案1
0 2021-10-19 16:07:30

Spark不断向死执行者提交任务

问题描述

1 个解决方案

解决方案1 0 2021-10-19 16:07:30

解决方案1
0 2021-10-19 16:07:30