简体繁体 English

如果执行器失败，Spark with External Shuffle Service 可以使用保存的 shuffle 文件吗？

[英]Can Spark with External Shuffle Service use saved shuffle files in the event of executor failure?

原文 2021-05-10 08:04:59 2 3 apache-spark

The Spark docs state: Spark 文档state：

a Spark executor exits either on failure or when the associated application has also exited. Spark 执行器在失败或关联的应用程序也退出时退出。 In both scenarios, all state associated with the executor is no longer needed and can be safely discarded.在这两种情况下，不再需要与执行器关联的所有 state 并且可以安全地丢弃。

However, in the scenario where the Spark cluster configuration and dataset are such that occasional executor failure OOM occurs deep into a job, it is far more preferable for the shuffle files written by the dead executor to continue to be available to the job rather than have them recomputed.但是，在 Spark 集群配置和数据集这样的场景中，偶尔 executor 故障 OOM 会深入到作业中，最好让死 executor 写入的 shuffle 文件继续对作业可用，而不是让作业继续可用。他们重新计算。

In such a scenario with the External Shuffle Service enabled, I have appeared to observe Spark continuing to fetch the afore mentioned shuffle files and only rerunning the tasks that were active at the time when the executor died.在启用外部随机播放服务的这种情况下，我似乎观察到 Spark 继续获取上述随机播放文件，并且仅重新运行在执行程序死亡时处于活动状态的任务。 In contrast, with the External Shuffle Service disabled I have seen Spark rerun a proportion of previously completed stages to recompute lost shuffle files as expected.相比之下，在禁用 External Shuffle Service 的情况下，我看到 Spark 重新运行了一部分先前完成的阶段，以按预期重新计算丢失的 shuffle 文件。

So can Spark with the External Shuffle Service enabled use saved shuffle files in the event of Executor failure as I have appeared to observe?那么，如果我观察到的 Executor 失败，启用了 External Shuffle Service 的 Spark 可以使用保存的 shuffle 文件吗？ I think so, but the documentation makes me doubt it.我想是的，但是文档让我怀疑。

I am running Spark 3.0.1 with Yarn on EMR 6.2 with dynamic allocation disabled.我在禁用动态分配的 EMR 6.2 上运行带有 Yarn 的 Spark 3.0.1。

Also, pre-empting comments, of course it is preferable to configure the cluster so that executor OOM never occurs.另外，抢先评论，当然最好配置集群，这样executor OOM就不会发生。 However, when initially aiming to complete an expensive Spark job the optimal cluster configuration is not yet achieved.但是，当最初旨在完成一项昂贵的 Spark 作业时，尚未实现最佳集群配置。 It is at this time that shuffle reuse in the face of executor failure is valuable.正是在这个时候，面对 executor 失败的 shuffle 重用是有价值的。

3 个解决方案

The sentence you quoted:你引用的那句话：

a Spark executor exits either on failure or when the associated application has also exited. Spark 执行器在失败或关联的应用程序也退出时退出。 In both scenarios, all state associated with the executor is no longer needed and can be safely discarded.在这两种情况下，不再需要与执行器关联的所有 state 并且可以安全地丢弃。

is from the "Graceful Decommission of Executors" section.来自“Graceful Decommission of Executors”部分。

That feature main intention is to provide a solution when Kubernetes is used as a resource manager.该功能的主要目的是在 Kubernetes 用作资源管理器时提供解决方案。 Where external shuffle service is not available.外部随机播放服务不可用的地方。 It is migrating the disk persisted RDD blocks and shuffle blocks to the remaining executors.它正在将磁盘持久化的 RDD 块和 shuffle 块迁移到剩余的执行器。

In case of Yarn when external shuffle service is enabled the blocks will be fetched from the external shuffle service which is running as an auxiliary service of the Yarn (within the node manager).对于 Yarn ，当外部 shuffle 服务启用时，块将从外部 shuffle 服务中获取，该服务作为 Yarn 的辅助服务（在节点管理器内）运行。 That service knows the executors internal directory structure and able to serve the blocks (as it is on the same host).该服务知道执行程序的内部目录结构并能够为块提供服务（因为它位于同一主机上）。 This way when the node survives and just the executor dies the blocks won't be lost.这样，当节点存活并且只有执行者死亡时，块不会丢失。

Dynamic resource allocation requires ESS to be enabled in Yarn.动态资源分配需要在 Yarn 中启用 ESS。 See this in the same doc.在同一个文档中看到这个。

ESS is an independent service running in node manager. ESS 是运行在节点管理器中的独立服务。 When executor dies, ESS on the same node is able to serve its map output files when remote executors try to fetch.当执行程序死亡时，同一节点上的 ESS 能够在远程执行程序尝试获取时为其 map output 文件提供服务。

I think the Spark Blacklist Design Doc answers the question (referenced in the HealthTracker class comments):我认为Spark Blacklist Design Doc回答了这个问题（在HealthTracker class 评论中引用）：

Spark already does some handling of failed nodes when it fetches a shuffle block. Spark 在获取 shuffle 块时已经对失败的节点进行了一些处理。 This automatically leads to spark marking the shuffle data as missing, and using the stage failure mechanism to regenerate the missing blocks based on lineage.这会自动导致 spark 将 shuffle 数据标记为丢失，并使用阶段故障机制根据 lineage 重新生成丢失的块。

ie It is fetch failure that triggers stage failure and subsequent shuffle block regeneration. ie 是 fetch 失败触发阶段失败和随后的 shuffle 块再生。 Fetch failure and hence subsequent stage failure will not occur on executor death when the Shuffle Service is enabled and serving the fetch requests.当启用 Shuffle 服务并为 fetch 请求提供服务时，不会在 executor 死亡时发生 Fetch 失败和后续阶段的失败。 Furthermore, there is nothing in either the blacklist design, or the driver's HeartBeatReceiver , to explicitly remove shuffle blocks written by a lost executor.此外，黑名单设计或驱动程序的HeartBeatReceiver都没有明确删除丢失的执行程序写入的 shuffle 块。 Hence, shuffle blocks written by a dead executor will continue to be served by the shuffle service on the associated node.因此，由死执行程序写入的 shuffle 块将继续由关联节点上的 shuffle 服务提供服务。