简体繁体 English

Spark UI 执行器

[英]Spark UI Executor

原文 2022-09-19 23:31:54 4 1 apache-spark/ amazon-emr

In Spark UI, there are 18 executors are added and 6 executors are removed.在 Spark UI 中，添加了 18 个执行器，删除了 6 个执行器。 When I checked the executor tabs, I've seen many dead and excluded executors.当我检查 executor 选项卡时，我看到了许多死亡和排除的 executor。 Currently, dynamic allocation is used in EMR.目前，在 EMR 中使用动态分配。

I've looked up some postings about dead executors but these mostly related with job failure.我查阅了一些关于已故执行人的帖子，但这些帖子大多与工作失败有关。 For my case, it seems that the job itself is not failed but can see dead and excluded executors.就我而言，似乎工作本身并没有失败，但可以看到死亡和被排除的执行者。

What are these "dead" and "excluded" executors?这些“死亡”和“排除”的执行者是什么？ How does it affect the performance of current spark cluster configuration?它如何影响当前 Spark 集群配置的性能？ (If it affects performance) then what would be good way to improve the performance? （如果它影响性能）那么提高性能的好方法是什么？

1 个解决方案

With dynamic alocation enabled spark is trying to adjust number of executors to number of tasks in active stages.启用动态分配后，spark 正在尝试将执行者的数量调整为活动阶段中的任务数量。 Lets take a look at this example:让我们看一下这个例子：

Job started, first stage is read from huge source which is taking some time.工作开始，第一阶段是从大量来源中读取的，这需要一些时间。 Lets say that this source is partitioned and Spark generated 100 task to get the data.假设这个源是分区的，Spark 生成了 100 个任务来获取数据。 If your executor has 5 cores, Spark is going to spawn 20 executors to ensure the best parallelism (20 executors x 5 cores = 100 tasks in parallel)如果您的执行器有 5 个核心，Spark 将生成 20 个执行器以确保最佳并行性（20 个执行器 x 5 个核心 = 100 个并行任务）
Lets say that on next step you are doing repartitioning or sor merge join, with shuffle partitions set to 200 spark is going to generated 200 tasks.可以说，在下一步中，您正在进行重新分区或合并连接，将随机分区设置为 200 spark 将生成 200 个任务。 He is smart enough to figure out that he has currently only 100 cores avilable so if new resources are avilable he will try to spawn another 20 executors (40 executors x 5 cores = 200 tasks in parallel)他很聪明地发现他目前只有 100 个可用内核，因此如果有新资源可用，他将尝试生成另外 20 个执行器（40 个执行器 x 5 个内核 = 200 个并行任务）
Now the join is done, in next stage you have only 50 partitions, to calculate this in parallel you dont need 40 executors, 10 is ok (10 executors x 5 cores = 50 tasks in paralell).现在连接完成了，在下一阶段你只有 50 个分区，要并行计算，你不需要 40 个执行器，10 个就可以（10 个执行器 x 5 个核心 = 50 个并行任务）。 Right now if process is taking enough of time Spark can free some resources and you are going to see deleted executors.现在，如果进程花费了足够的时间，Spark 可以释放一些资源，您将看到已删除的执行程序。
Now we have next stage which involves repartitioning.现在我们有涉及重新分区的下一阶段。 Number of partitions equals to 200. Withs 10 executors you can process in paralell only 50 partitions.分区数等于 200。使用 10 个执行器，您只能并行处理 50 个分区。 Spark will try to get new executors... Spark 将尝试获得新的执行者......

You can read this blog post: https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/您可以阅读这篇博文： https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/

The problem with the spark.dynamicAllocation.enabled property is that it requires you to set subproperties. spark.dynamicAllocation.enabled 属性的问题在于它需要您设置子属性。 Some example subproperties are spark.dynamicAllocation.initialExecutors, minExecutors, and maxExecutors.一些示例子属性是 spark.dynamicAllocation.initialExecutors、minExecutors 和 maxExecutors。 Subproperties are required for most cases to use the right number of executors in a cluster for an application, especially when you need multiple applications to run simultaneously.大多数情况下需要子属性才能在集群中为应用程序使用正确数量的执行程序，尤其是当您需要同时运行多个应用程序时。 Setting subproperties requires a lot of trial and error to get the numbers right.设置子属性需要大量的试验和错误才能获得正确的数字。 If they're not right, the capacity might be reserved but never actually used.如果它们不正确，则可能会保留容量但从未实际使用过。 This leads to wastage of resources or memory errors for other applications.这会导致资源浪费或其他应用程序出现 memory 错误。

Here you will find some hints, from my experience it is worth to set maxExecutors if you are going to run few jobs in parallel in the same cluster as most of the time it is not worth to starve other jobs just to get 100% efficiency from one job在这里你会发现一些提示，根据我的经验，如果你要在同一个集群中并行运行几个作业，那么设置 maxExecutors 是值得的，因为大多数时候为了获得 100% 的效率而饿死其他作业是不值得的一份工作