Spark不断向死执行者提交任务

Question

我正在研究 apache spark，但我正面临一个非常奇怪的问题。 其中一个执行程序因 OOM 而失败，其关闭挂钩清除了所有存储（内存和磁盘），但显然由于 PROCESS_LOCAL 任务，驱动程序一直在同一执行程序上提交失败的任务。

现在该机器上的存储已清除，所有重试任务也失败导致整个阶段失败（重试 4 次后）

我不明白的是，驱动程序怎么可能不知道执行程序处于关闭状态并且无法执行任何任务。

配置：

心跳间隔：60s
网络超时：600s

日志以确认执行程序正在接受关闭后的任务

20/09/29 20:26:32 ERROR [Executor task launch worker for task 138513] Executor: Exception in task 6391.0 in stage 17.0 (TID 138513)
java.lang.OutOfMemoryError: Java heap space
20/09/29 20:26:32 ERROR [Executor task launch worker for task 138513] SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 138513,5,main]
java.lang.OutOfMemoryError: Java heap space
20/09/29 20:26:32 INFO [pool-8-thread-1] DiskBlockManager: Shutdown hook called
20/09/29 20:26:35 ERROR [Executor task launch worker for task 138295] Executor: Exception in task 6239.0 in stage 17.0 (TID 138295)
java.io.FileNotFoundException: /storage/1/spark/spark-ba168da6-dc11-4e15-bd95-1e58198c81e7/executor-8dea198c-741a-4733-8fbb-df57241acdd5/blockmgr-1fc6b30a-c24e-4bb2-a133-5e411cef810f/35/temp_shuffle_b5df90ac-78de-48e3-9c2d-891f8b2ce1fa (No such file or directory)
20/09/29 20:26:36 ERROR [Executor task launch worker for task 139484] Executor: Exception in task 6587.0 in stage 17.0 (TID 139484)
org.apache.spark.SparkException: Block rdd_3861_6587 was not found even though it's read-locked
20/09/29 20:26:42 WARN [Thread-2] ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException: null
    at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[?:1.8.0_172]
20/09/29 20:26:44 ERROR [Executor task launch worker for task 140256] Executor: Exception in task 6576.3 in stage 17.0 (TID 140256)
java.io.FileNotFoundException: /storage/1/spark/spark-ba168da6-dc11-4e15-bd95-1e58198c81e7/executor-8dea198c-741a-4733-8fbb-df57241acdd5/blockmgr-1fc6b30a-c24e-4bb2-a133-5e411cef810f/30/rdd_3861_6576 (No such file or directory)
20/09/29 20:26:44 INFO [dispatcher-event-loop-0] Executor: Executor is trying to kill task 6866.1 in stage 17.0 (TID 140329), reason: stage cancelled
20/09/29 20:26:47 INFO [pool-8-thread-1] ShutdownHookManager: Shutdown hook called
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: stopping client from cache: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: stopping client from cache: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: removing client from cache: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: stopping actual client because no more references remain: org.apache.hadoop.ipc.Client@3117bde5
20/09/29 20:26:47 DEBUG [pool-8-thread-1] Client: Stopping client
20/09/29 20:26:47 DEBUG [Thread-2] ShutdownHookManager: ShutdownHookManger complete shutdown
20/09/29 20:26:55 INFO [dispatcher-event-loop-14] CoarseGrainedExecutorBackend: Got assigned task 141510
20/09/29 20:26:55 INFO [Executor task launch worker for task 141510] Executor: Running task 545.1 in stage 26.0 (TID 141510)

（我已经修剪了堆栈跟踪，因为那些只是 spark RDD shuffle 读取方法）

如果我们检查时间戳，然后关机开始在20/09/29 20:26:32 ，结束时为20/09/29 20:26:47 ，在此期间之间，驱动程序发送的所有重试的任务相同的执行和他们都失败了，导致舞台取消。

有人可以帮我理解这种行为吗？ 如果还需要其他任何东西，请告诉我

Answer 1

spark中有一个配置，spark.task.maxFailures默认为4。所以spark在任务失败时重试任务。 并且 Taskrunner 会将任务状态更新到驱动程序。 并且驱动程序将此状态转发给taskschedulerimpl。 在您的情况下，只有执行程序 OOM 和 DiskBlockManager 关闭，但驱动程序还活着。 我认为执行者也还活着。 并将使用相同的 taskSetManager 重试任务。 此任务的失败次数达到 4 次，此阶段将被取消，执行者将被杀死。

Spark不断向死执行者提交任务

问题描述

1 个解决方案

解决方案1
0 2021-10-19 16:07:30

Spark不断向死执行者提交任务

问题描述

1 个解决方案

解决方案1 0 2021-10-19 16:07:30

解决方案1
0 2021-10-19 16:07:30