[英]Tuning Spark: number of executors per node when cores available are uneven
I have read that having 5 cores per Executor in Spark achieves the optimal read/write throughput - so setting spark.executor.cores = 5
is usually desired.我已经读到 Spark 中每个 Executor 有 5 个内核可以实现最佳的读/写吞吐量 - 因此通常需要设置
spark.executor.cores = 5
。 And also that you should subtract one core per node to allow for the underlying daemon processes to run.此外,您应该为每个节点减去一个核心,以允许底层守护进程运行。
So, determining the number of executors per node follows this formula:因此,确定每个节点的执行者数量遵循以下公式:
executors per node = (cores per node - 1) / 5 cores per executor
However, what is the best approach in a scenario where you have 8 cores in each node machine?但是,在每个节点机器中有 8 个内核的情况下,最好的方法是什么?
1.4 executors per node = (8 - 1) / 5
First question - will Spark/yarn have an executor spanning multiple nodes?第一个问题——Spark/yarn 会有一个跨越多个节点的执行器吗?
If not - then I need to round.如果不是 - 那么我需要四舍五入。 Which way should I go?
我应该用哪种方式 go? It seems my options are:
看来我的选择是:
1.) round down to 1 - meaning I'd only have 1 executor per node. 1.) 向下舍入到 1 - 这意味着我每个节点只有 1 个执行器。 I could increase the cores per executor, though don't know if I would get any benefit to that.
我可以增加每个执行者的核心,但不知道我是否会从中受益。
2.) round up to 2 - that means I'd have to decrease the cores per executor to 3 (8 cores available, - 1 for the daemons, and can't have 1/2 a core), which could decrease their efficiency. 2.)四舍五入到 2 - 这意味着我必须将每个执行程序的核心减少到 3(8 个可用核心, - 1 个用于守护进程,并且不能有 1/2 个核心),这可能会降低它们的效率.
Here spark.executor.cores = 5
is not a hard-lined value.这里
spark.executor.cores = 5
不是一个强硬的值。 Thumb rule is # of cores equal to or less than 5
.拇指规则是
# of cores equal to or less than 5
。
We need 1 core for OS & other Hadoop daemons.我们需要 1 个内核用于操作系统和其他 Hadoop 守护进程。 We are left with 7 cores per node.
每个节点剩下 7 个核心。 Remember we need 1 executor for YARN out of all the executors.
请记住,在所有执行者中,我们需要 1 个 YARN 执行者。
When spark.executor.cores = 4
we cannot leave 1 executor for YARN, so I suggest I not take up this value.当
spark.executor.cores = 4
时,我们不能为 YARN 留下 1 个 executor,所以我建议我不要使用这个值。
When spark.executor.cores = 3
or spark.executor.cores = 2
after leaving one node for YARN we will always be left out with 1 executor per node.当
spark.executor.cores = 3
或spark.executor.cores = 2
在为 YARN 留出一个节点后,我们将始终被排除在外,每个节点都有 1 个执行程序。
Now which one is efficient for your code.现在哪一个对您的代码有效。 Well that cannot be interpreted, it depends on multiple other factors like the amount of data used, # of joins used etc.
好吧,这无法解释,它取决于多个其他因素,例如使用的数据量、使用的连接数等。
This is based on my understanding.这是基于我的理解。 It provides a start to explore multiple other options.
它提供了探索多种其他选择的开始。
NOTE: If you are using some outer Java libraries & Datasets in your code, you might need to have 1 core per executor for preserving the type safety.注意:如果您在代码中使用一些外部 Java 库和数据集,您可能需要每个执行程序有 1 个核心以保持类型安全。
Hope it helps...希望能帮助到你...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.