简体   繁体   English

调整 Spark:当可用内核不均匀时每个节点的执行器数量

[英]Tuning Spark: number of executors per node when cores available are uneven

I have read that having 5 cores per Executor in Spark achieves the optimal read/write throughput - so setting spark.executor.cores = 5 is usually desired.我已经读到 Spark 中每个 Executor 有 5 个内核可以实现最佳的读/写吞吐量 - 因此通常需要设置spark.executor.cores = 5 And also that you should subtract one core per node to allow for the underlying daemon processes to run.此外,您应该为每个节点减去一个核心,以允许底层守护进程运行。

So, determining the number of executors per node follows this formula:因此,确定每个节点的执行者数量遵循以下公式:

executors per node = (cores per node - 1) / 5 cores per executor

However, what is the best approach in a scenario where you have 8 cores in each node machine?但是,在每个节点机器中有 8 个内核的情况下,最好的方法是什么?

1.4 executors per node = (8 - 1) / 5

First question - will Spark/yarn have an executor spanning multiple nodes?第一个问题——Spark/yarn 会有一个跨越多个节点的执行器吗?

If not - then I need to round.如果不是 - 那么我需要四舍五入。 Which way should I go?我应该用哪种方式 go? It seems my options are:看来我的选择是:

1.) round down to 1 - meaning I'd only have 1 executor per node. 1.) 向下舍入到 1 - 这意味着我每个节点只有 1 个执行器。 I could increase the cores per executor, though don't know if I would get any benefit to that.我可以增加每个执行者的核心,但不知道我是否会从中受益。

2.) round up to 2 - that means I'd have to decrease the cores per executor to 3 (8 cores available, - 1 for the daemons, and can't have 1/2 a core), which could decrease their efficiency. 2.)四舍五入到 2 - 这意味着我必须将每个执行程序的核心减少到 3(8 个可用核心, - 1 个用于守护进程,并且不能有 1/2 个核心),这可能会降低它们的效率.

Here spark.executor.cores = 5 is not a hard-lined value.这里spark.executor.cores = 5不是一个强硬的值。 Thumb rule is # of cores equal to or less than 5 .拇指规则是# of cores equal to or less than 5

We need 1 core for OS & other Hadoop daemons.我们需要 1 个内核用于操作系统和其他 Hadoop 守护进程。 We are left with 7 cores per node.每个节点剩下 7 个核心。 Remember we need 1 executor for YARN out of all the executors.请记住,在所有执行者中,我们需要 1 个 YARN 执行者。

When spark.executor.cores = 4 we cannot leave 1 executor for YARN, so I suggest I not take up this value.spark.executor.cores = 4时,我们不能为 YARN 留下 1 个 executor,所以我建议我不要使用这个值。

When spark.executor.cores = 3 or spark.executor.cores = 2 after leaving one node for YARN we will always be left out with 1 executor per node.spark.executor.cores = 3spark.executor.cores = 2在为 YARN 留出一个节点后,我们将始终被排除在外,每个节点都有 1 个执行程序。

Now which one is efficient for your code.现在哪一个对您的代码有效。 Well that cannot be interpreted, it depends on multiple other factors like the amount of data used, # of joins used etc.好吧,这无法解释,它取决于多个其他因素,例如使用的数据量、使用的连接数等。

This is based on my understanding.这是基于我的理解。 It provides a start to explore multiple other options.它提供了探索多种其他选择的开始。

NOTE: If you are using some outer Java libraries & Datasets in your code, you might need to have 1 core per executor for preserving the type safety.注意:如果您在代码中使用一些外部 Java 库和数据集,您可能需要每个执行程序有 1 个核心以保持类型安全。

Hope it helps...希望能帮助到你...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM