简体繁体 English

调整 Spark：当可用内核不均匀时每个节点的执行器数量

[英]Tuning Spark: number of executors per node when cores available are uneven

原文 2019-10-14 16:47:41 8 1 apache-spark

I have read that having 5 cores per Executor in Spark achieves the optimal read/write throughput - so setting spark.executor.cores = 5 is usually desired.我已经读到 Spark 中每个 Executor 有 5 个内核可以实现最佳的读/写吞吐量 - 因此通常需要设置spark.executor.cores = 5 。 And also that you should subtract one core per node to allow for the underlying daemon processes to run.此外，您应该为每个节点减去一个核心，以允许底层守护进程运行。

So, determining the number of executors per node follows this formula:因此，确定每个节点的执行者数量遵循以下公式：

executors per node = (cores per node - 1) / 5 cores per executor

However, what is the best approach in a scenario where you have 8 cores in each node machine?但是，在每个节点机器中有 8 个内核的情况下，最好的方法是什么？

1.4 executors per node = (8 - 1) / 5

First question - will Spark/yarn have an executor spanning multiple nodes?第一个问题——Spark/yarn 会有一个跨越多个节点的执行器吗？

If not - then I need to round.如果不是 - 那么我需要四舍五入。 Which way should I go?我应该用哪种方式 go？ It seems my options are:看来我的选择是：

1.) round down to 1 - meaning I'd only have 1 executor per node. 1.) 向下舍入到 1 - 这意味着我每个节点只有 1 个执行器。 I could increase the cores per executor, though don't know if I would get any benefit to that.我可以增加每个执行者的核心，但不知道我是否会从中受益。

2.) round up to 2 - that means I'd have to decrease the cores per executor to 3 (8 cores available, - 1 for the daemons, and can't have 1/2 a core), which could decrease their efficiency. 2.）四舍五入到 2 - 这意味着我必须将每个执行程序的核心减少到 3（8 个可用核心， - 1 个用于守护进程，并且不能有 1/2 个核心），这可能会降低它们的效率.

1 个解决方案

Here spark.executor.cores = 5 is not a hard-lined value.这里spark.executor.cores = 5不是一个强硬的值。 Thumb rule is # of cores equal to or less than 5 .拇指规则是# of cores equal to or less than 5 。

We need 1 core for OS & other Hadoop daemons.我们需要 1 个内核用于操作系统和其他 Hadoop 守护进程。 We are left with 7 cores per node.每个节点剩下 7 个核心。 Remember we need 1 executor for YARN out of all the executors.请记住，在所有执行者中，我们需要 1 个 YARN 执行者。

When spark.executor.cores = 4 we cannot leave 1 executor for YARN, so I suggest I not take up this value.当spark.executor.cores = 4时，我们不能为 YARN 留下 1 个 executor，所以我建议我不要使用这个值。

When spark.executor.cores = 3 or spark.executor.cores = 2 after leaving one node for YARN we will always be left out with 1 executor per node.当spark.executor.cores = 3或spark.executor.cores = 2在为 YARN 留出一个节点后，我们将始终被排除在外，每个节点都有 1 个执行程序。

Now which one is efficient for your code.现在哪一个对您的代码有效。 Well that cannot be interpreted, it depends on multiple other factors like the amount of data used, # of joins used etc.好吧，这无法解释，它取决于多个其他因素，例如使用的数据量、使用的连接数等。

This is based on my understanding.这是基于我的理解。 It provides a start to explore multiple other options.它提供了探索多种其他选择的开始。

NOTE: If you are using some outer Java libraries & Datasets in your code, you might need to have 1 core per executor for preserving the type safety.注意：如果您在代码中使用一些外部 Java 库和数据集，您可能需要每个执行程序有 1 个核心以保持类型安全。

Hope it helps...希望能帮助到你...