简体繁体 English

如何使用执行器使我的 Spark 作业运行得更快？

[英]How do I make my Spark job run faster using executors?

原文 2022-01-31 17:44:38 0 1 apache-spark/ palantir-foundry/ foundry-code-repositories/ foundry-code-workbooks

I know my code is free from antipatterns since I don't have any warnings in my Authoring code editor, so I know my code is doing PySpark operations that are distributed and scalable.我知道我的代码没有反模式，因为我的创作代码编辑器中没有任何警告，所以我知道我的代码正在执行分布式和可扩展的 PySpark 操作。

My current job has 2 executors assigned to it with 2 cores each, and it runs with task parallelism of 16 as seen on the Spark Details page.我当前的工作分配了 2 个执行程序，每个执行程序有 2 个内核，并且它以 16 的任务并行度运行，如 Spark 详细信息页面所示。

How do I make this job run faster?如何让这项工作运行得更快？

1 个解决方案

Your Executors are the pieces of Spark infrastructure assigned to 'execute' your work.您的 Executors 是分配给“执行”您的工作的 Spark 基础架构的一部分。 As such, the more of these 'workers' you have, the more work you are able to do in parallel and the faster your job will be.因此，您拥有的这些“工人”越多，您能够并行完成的工作就越多，您的工作就会越快。

There's a limit to the amount your job will increase in speed however, and this is a function of the max number of tasks in your stages.但是，您的工作将提高速度的数量是有限的，这是您阶段中最大任务数的 function。

For instance, if my data scale is such that I only ever have a maximum of 8 tasks (let's assume AQE is controlling this), assigning an executor count to run more than 8 tasks will waste resources and won't increase your job speed.例如，如果我的数据规模是这样的，我最多只能有 8 个任务（假设AQE控制了这个），分配一个执行器计数来运行超过 8 个任务将浪费资源并且不会提高你的工作速度。

The job defaults in most Foundry environments are 2 executors with 2 cores each, and 1 core per task.大多数 Foundry 环境中的作业默认值是 2 个执行器，每个执行器有 2 个核心，每个任务有 1 个核心。 This means your job is capable of running 4 cores at a time, which means 4 tasks.这意味着您的工作能够一次运行 4 个内核，这意味着 4 个任务。

This means if your max task counts per stage in your job is 4, you won't benefit from boosting your number of executors.这意味着如果您的工作中每个阶段的最大任务数为 4，您将不会从增加执行者数量中受益。 If, however, you observe your stages have, for instance, 16 tasks, then you can choose to increase the number of executors in your job as such:但是，如果您观察到您的阶段有 16 个任务，那么您可以选择增加工作中执行者的数量，如下所示：

16 max tasks, 1 core per task.最多 16 个任务，每个任务 1 个核心。 -> 16 cores needed. -> 需要 16 个核心。
2 cores per executor -> 8 executors max.每个执行器 2 个核心 -> 最多 8 个执行器。

We could therefore jump this example job up to 8 executors for maximum performance.因此，我们可以将此示例作业最多跳到 8 个执行程序以获得最佳性能。

For the original question, you would bump the number of executors to 8 for maximum performance.对于最初的问题，您可以将执行程序的数量增加到 8 个以获得最佳性能。 Any further bump of executor count would not be used and would waste resources.不会使用任何进一步增加的执行器数量，并且会浪费资源。