简体繁体 English

SPARK_WORKER_CORES设置如何影响Spark Standalone中的并发性

[英]How SPARK_WORKER_CORES setting impacts concurrency in Spark Standalone

原文 2018-01-29 06:15:04 2 1 apache-spark/ streaming/ distributed-computing/ apache-spark-standalone

I am using a Spark 2.2.0 cluster configured in Standalone mode. 我正在使用在独立模式下配置的Spark 2.2.0群集。 Cluster has 2 octa core machines. 集群有2个八核核心机器。 This cluster is exclusively for Spark jobs and no other process uses them. 此集群专门用于Spark作业，没有其他进程使用它们。 I have around 8 Spark Streaming apps which run on this cluster. 我有大约8个Spark Streaming应用程序在这个集群上运行。
I explicitly set SPARK_WORKER_CORES (in spark-env.sh) to 8 and allocate one core to each app using total-executor-cores setting. 我明确地将SPARK_WORKER_CORES（在spark-env.sh中）设置为8，并使用total-executor-cores设置为每个应用分配一个核心。 This config reduces the capability to work in parallel on multiple tasks. 此配置降低了在多个任务上并行工作的能力。 If a stage works on a partitioned RDD with 200 partitions, only one task executes at a time. 如果一个阶段在具有200个分区的分区RDD上工作，则一次只执行一个任务。 What I wanted Spark to do was to start separate thread for each job and process in parallel. 我希望Spark做的是为每个作业和进程并行启动单独的线程。 But I couldn't find a separate Spark setting to control the number of threads. 但我找不到单独的Spark设置来控制线程数。
So, I decided to play around and bloated the number of cores (ie SPARK_WORKER_CORES in spark-env.sh) to 1000 on each machine. 所以，我决定在每台机器上玩耍并将核心数量（即spark-env.sh中的SPARK_WORKER_CORES）增加到1000。 Then I gave 100 cores to each Spark application. 然后我为每个Spark应用程序提供了100个内核。 I found that spark started processing 100 partitons in parallel this time indicating that 100 threads were being used. 我发现此时火花开始并行处理100个分区，表明正在使用100个线程。
I am not sure if this is the correct method of impacting the number of threads used by a Spark job. 我不确定这是否是影响Spark作业使用的线程数的正确方法。

1 个解决方案

You mixed up two things: 你混淆了两件事：

Cluster manger properties - SPARK_WORKER_CORES - total number of cores that worker can offer. 群集管理器属性 - SPARK_WORKER_CORES - 工作人员可以提供的核心总数。 Use it to control a fraction of resources that should be used by Spark in total 用它来控制Spark总共应该使用的一小部分资源
Application properties --total-executor-cores / spark.cores.max - number of cores that application requests from the cluster manager. 应用程序属性--total-executor-cores / spark.cores.max - 来自集群管理器的应用程序请求的核心数。 Use it control in-app parallelism. 使用它控制应用内并行性。