简体   繁体   English

如何使用执行器使我的 Spark 作业运行得更快?

[英]How do I make my Spark job run faster using executors?

I know my code is free from antipatterns since I don't have any warnings in my Authoring code editor, so I know my code is doing PySpark operations that are distributed and scalable.我知道我的代码没有反模式,因为我的创作代码编辑器中没有任何警告,所以我知道我的代码正在执行分布式和可扩展的 PySpark 操作。

My current job has 2 executors assigned to it with 2 cores each, and it runs with task parallelism of 16 as seen on the Spark Details page.我当前的工作分配了 2 个执行程序,每个执行程序有 2 个内核,并且它以 16 的任务并行度运行,如 Spark 详细信息页面所示。

How do I make this job run faster?如何让这项工作运行得更快?

Your Executors are the pieces of Spark infrastructure assigned to 'execute' your work.您的 Executors 是分配给“执行”您的工作的 Spark 基础架构的一部分。 As such, the more of these 'workers' you have, the more work you are able to do in parallel and the faster your job will be.因此,您拥有的这些“工人”越多,您能够并行完成的工作就越多,您的工作就会越快。

There's a limit to the amount your job will increase in speed however, and this is a function of the max number of tasks in your stages.但是,您的工作将提高速度的数量是有限的,这是您阶段中最大任务数的 function。

For instance, if my data scale is such that I only ever have a maximum of 8 tasks (let's assume AQE is controlling this), assigning an executor count to run more than 8 tasks will waste resources and won't increase your job speed.例如,如果我的数据规模是这样的,我最多只能有 8 个任务(假设AQE控制了这个),分配一个执行器计数来运行超过 8 个任务将浪费资源并且不会提高你的工作速度。

The job defaults in most Foundry environments are 2 executors with 2 cores each, and 1 core per task.大多数 Foundry 环境中的作业默认值是 2 个执行器,每个执行器有 2 个核心,每个任务有 1 个核心。 This means your job is capable of running 4 cores at a time, which means 4 tasks.这意味着您的工作能够一次运行 4 个内核,这意味着 4 个任务。

This means if your max task counts per stage in your job is 4, you won't benefit from boosting your number of executors.这意味着如果您的工作中每个阶段的最大任务数为 4,您将不会从增加执行者数量中受益。 If, however, you observe your stages have, for instance, 16 tasks, then you can choose to increase the number of executors in your job as such:但是,如果您观察到您的阶段有 16 个任务,那么您可以选择增加工作中执行者的数量,如下所示:

16 max tasks, 1 core per task.最多 16 个任务,每个任务 1 个核心。 -> 16 cores needed. -> 需要 16 个核心。
2 cores per executor -> 8 executors max.每个执行器 2 个核心 -> 最多 8 个执行器。

We could therefore jump this example job up to 8 executors for maximum performance.因此,我们可以将此示例作业最多跳到 8 个执行程序以获得最佳性能。

For the original question, you would bump the number of executors to 8 for maximum performance.对于最初的问题,您可以将执行程序的数量增加到 8 个以获得最佳性能。 Any further bump of executor count would not be used and would waste resources.不会使用任何进一步增加的执行器数量,并且会浪费资源。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使执行程序使用--num-executors运行spark程序? - how to make executors run spark program by using --num-executors? Spark - 为我的spark作业分配了多少个执行器和内核 - Spark - How many Executors and Cores are allocated to my spark job 我在哪里配置 dataproc 集群中 spark 作业的 spark 执行器和执行器 memory? - Where do I configure spark executors and executor memory of a spark job in a dataproc cluster? 如何使用 Apache Spark 使我的 Postgres 查询更快? - How to make my Postgres Query faster using Apache Spark? 如何设置配置以使Spark / Yarn作业更快? - How to set configurations to make Spark/Yarn job faster? 当分区数与执行器数不匹配时,如何处理Spark执行器? - How to handle Spark Executors when number of partitions do not match no of Executors? spark 中的执行者和 yarn 中的 application master 是否做同样的工作? - Does executors in spark nd application master in yarn do the same job? 与群集中的节点相比,EMR Spark作业使用的执行者更少 - EMR Spark job using less executors than nodes in the cluster 作业使用的执行者的火花数 - Spark number of executors that job uses 如何计算 Spark 作业中的内核数、执行器数、内存量 - How to calculate No of cores,executors, amount of memory in Spark Job
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM