简体繁体 English

Apache Spark独立模式：核心数量

[英]Apache Spark standalone mode: number of cores

原文 2015-01-23 22:44:08 9 1 multithreading/ deployment/ apache-spark

I'm trying to understand the basics of Spark internals and Spark documentation for submitting applications in local mode says for spark-submit --master setting: 我正在尝试理解Spark内部的基础知识和用于在本地模式下提交应用程序的Spark文档，请参阅spark-submit --master设置：

local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). local [K]使用K工作线程在本地运行Spark（理想情况下，将其设置为计算机上的核心数）。

local[*] Run Spark locally with as many worker threads as logical cores on your machine. local [*]使用与计算机上的逻辑核心一样多的工作线程在本地运行Spark。

Since all the data is stored on a single local machine, it does not benefit from distributed operations on RDD s. 由于所有数据都存储在单个本地计算机上，因此它不会受益于RDD的分布式操作。

How does it benefit and what internally is going on when Spark utilizes several logical cores? 当Spark使用多个逻辑核心时，它如何受益以及内部正在发生什么？

1 个解决方案

The system will allocate additional threads for processing data. 系统将分配额外的线程来处理数据。 Despite being limited to a single machine, it can still take advantage of the high degree of parallelism available in modern servers. 尽管仅限于一台机器，它仍然可以利用现代服务器中可用的高度并行性。

If you have a reasonable sized data set, say something with a dozen partitions, you can measure the time it takes to use local[1] vs local[n] (where n is the number of cores in your machine). 如果你有一个合理大小的数据集，比如说有十几个分区，可以测量使用local [1] vs local [n]所需的时间（其中n是你机器中的核心数）。 You can also see the difference in utilization of your machine. 您还可以看到机器利用率的差异。 If you only have one core designated for use, it will only use 100% of one core (plus some extra for garbage collection). 如果您只有一个核心指定使用，它将只使用100％的一个核心（加上一些额外的垃圾收集）。 If you have 4 cores, and specify local[4], it will use 400% of a core (4 cores). 如果你有4个核心，并指定local [4]，它将使用400％的核心（4个核心）。 And execution time can be significantly shortened (although not typically by 4x). 并且可以显着缩短执行时间（尽管通常不会缩短4倍）。