简体繁体 English

Spark 性能调优 - 执行器数量与内核数量

[英]Spark performance tuning - number of executors vs number for cores

原文 2016-08-17 20:49:04 9 2 apache-spark/ spark-streaming

I have two questions around performance tuning in Spark:我有两个关于 Spark 性能调优的问题：

I understand one of the key things for controlling parallelism in the spark job is the number of partitions that exist in the RDD that is being processed, and then controlling the executors and cores processing these partitions.我理解在 Spark 作业中控制并行性的关键之一是正在处理的 RDD 中存在的分区数量，然后控制处理这些分区的执行程序和内核。 Can I assume this to be true:我可以假设这是真的吗：
- # of executors * # of executor cores shoud be <= # of partitions. # of executors * # of executor cores 应该 <= # of partitions。 ie to say one partition is always processed in one core of one executor.即说一个分区总是在一个执行器的一个核心中处理。 There is no point having more executors*cores than the number of partitions执行程序*核心数比分区数多没有意义
I understand that having a high number of cores per executor can have a -ve impact on things like HDFS writes, but here's my second question, purely from a data processing point of view what is the difference between the two?我知道每个执行程序拥有大量内核会对 HDFS 写入之类的事情产生 -ve 影响，但这是我的第二个问题，纯粹从数据处理的角度来看，两者之间有什么区别？ For eg if I have 10 node cluster what would be the difference between these two jobs (assuming there's ample memory per node to process everything):例如，如果我有 10 个节点集群，这两个作业之间的区别是什么（假设每个节点有足够的内存来处理所有内容）：
1. 5 executors * 2 executor cores 5 个执行器 * 2 个执行器核心
2. 2 executors * 5 executor cores 2 个执行器 * 5 个执行器核心
Assuming there's infinite memory and CPU, from a performance point of view should we expect the above two to perform the same?假设有无限的内存和 CPU，从性能的角度来看，我们是否应该期望上述两者的性能相同？

2 个解决方案

Most of the time using larger executors (more memory, more cores) are better.大多数情况下，使用更大的执行器（更多内存、更多内核）会更好。 One: larger executor with large memory can easily support broadcast joins and do away with shuffle.一：大内存的大执行器可以轻松支持广播连接，摆脱shuffle。 Second: since tasks are not created equal, statistically larger executors have better chance of surviving OOM issues.第二：由于任务不是平等的，统计上较大的执行者有更好的机会在 OOM 问题中幸存下来。 The only problem with large executors is GC pauses.大型执行程序的唯一问题是 GC 暂停。 G1GC helps. G1GC 有帮助。

In my experience, if I had a cluster with 10 nodes, I would go for 20 spark executors.根据我的经验，如果我有一个包含 10 个节点的集群，我会选择 20 个 spark 执行程序。 The details of the job matter a lot, so some testing will help determine the optional configuration.作业的细节很重要，因此一些测试将有助于确定可选配置。