为什么这个简单的Spark程序没有使用多个内核？

Question

So, I'm running this simple program on a 16 core multicore system. 所以，我在16核多核系统上运行这个简单的程序。 I run it by issuing the following. 我通过发出以下命令来运行它。

spark-submit --master local[*] pi.py

And the code of that program is the following. 该程序的代码如下。

#"""pi.py"""
from pyspark import SparkContext
import random

N = 12500000

def sample(p):
    x, y = random.random(), random.random()
    return 1 if x*x + y*y < 1 else 0

sc = SparkContext("local", "Test App")
count = sc.parallelize(xrange(0, N)).map(sample).reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)

When I use top to see CPU consumption, only 1 core is being utilized. 当我使用top来查看CPU消耗时，只使用了1个核心。 Why is it so? 为什么会这样？ Seconldy, spark documentation says that the default parallelism is contained in property spark.default.parallelism . Seconldy，spark文档说默认并行性包含在属性spark.default.parallelism中 。 How can I read this property from within my python program? 如何从我的python程序中读取此属性？

Answer 1

As none of the above really worked for me (maybe because I didn't really understand them), here is my two cents. 因为以上都没有真正对我有用（也许是因为我真的不理解它们），这是我的两分钱。

I was starting my job with spark-submit program.py and inside the file I had sc = SparkContext("local", "Test") . 我用spark-submit program.py开始我的工作，在文件里面我有sc = SparkContext("local", "Test") 。 I tried to verify the number of cores spark sees with sc.defaultParallelism . 我试图用sc.defaultParallelism验证spark 看到的内核数量。 It turned out that it was 1. When I changed the context initialization to sc = SparkContext("local[*]", "Test") it became 16 (the number of cores of my system) and my program was using all the cores. 事实证明它是1.当我将上下文初始化更改为sc = SparkContext("local[*]", "Test")它变为16（我的系统的核心数），我的程序使用了所有核心。

I am quite new to spark , but my understanding is that local by default indicates the use of one core and as it is set inside the program, it would overwrite the other settings (for sure in my case it overwrites those from configuration files and environment variables). 我很新的火花，但我的理解是，在默认情况下本地指示使用一个核心的，当它被设置里面的程序，它会覆盖其他设置（肯定在我的情况下，它会覆盖从配置文件和环境变量）。

Answer 2

Probably because the call to sc.parallelize puts all the data into one single partition. 可能是因为对sc.parallelize的调用将所有数据放入一个单独的分区。 You can specify the number of partitions as 2nd argument to parallelize: 您可以将分区数指定为并行化的第二个参数：

part = 16
count = sc.parallelize(xrange(N), part).map(sample).reduce(lambda a, b: a + b)

Note that this would still generate the 12 millions points with one CPU in the driver and then only spread them out to 16 partitions to perform the reduce step. 请注意，这仍然会在驱动程序中使用一个CPU生成1200万个点，然后仅将它们分散到16个分区以执行reduce步骤。

A better approach would try to do most of the work after the partitioning: for example the following generates only a tiny array on the driver and then lets each remote task generate the actual random numbers and subsequent PI approximation: 在分区之后，更好的方法是尝试完成大部分工作：例如，以下内容仅在驱动程序上生成一个小数组，然后让每个远程任务生成实际的随机数和随后的PI近似值：

part = 16
count = ( sc.parallelize([0] * part, part)
           .flatMap(lambda blah: [sample(p) for p in xrange( N/part)])
           .reduce(lambda a, b: a + b)
       )

Finally, (because the more lazy we are the better), spark mllib actually comes already with a random data generation which is nicely parallelized, have a look here: http://spark.apache.org/docs/1.1.0/mllib-statistics.html#random-data-generation . 最后，（因为我们越懒越越好），spark mllib实际上已经有了一个很好地并行化的随机数据生成，请看一下： http ：//spark.apache.org/docs/1.1.0/mllib -statistics.html＃random-data-generation 。 So maybe the following is close to what you try to do (not tested => probably not working, but should hopefully be close) 所以也许以下内容接近您尝试做的事情（未经测试=>可能无法正常工作，但希望能够接近）

count = ( RandomRDDs.uniformRDD(sc, N, part)
        .zip(RandomRDDs.uniformRDD(sc, N, part))
        .filter (lambda (x, y): x*x + y*y < 1)
        .count()
        )

Answer 3

To change the CPU core consumption, set the number of cores to be used by the workers in the spark-env.sh file in spark-installation-directory/conf This is done with the SPARK_EXECUTOR_CORES attribute in spark-env.sh file. 要更改CPU核心消耗，请在spark-env.sh spark-installation-directory/conf中的spark-env.sh文件中设置spark-env.sh使用的核心数。这可以通过spark-env.sh文件中的SPARK_EXECUTOR_CORES属性完成。 The value is set to 1 by default. 默认情况下，该值设置为1。

Answer 4

I tried the method mentioned by @Svend, but still does not work. 我尝试了@Svend提到的方法，但仍然无效。

The following works for me: 以下适用于我：

Do NOT use the local url, for example: 不要使用local URL，例如：

sc = SparkContext("local", "Test App") . sc = SparkContext("local", "Test App") 。

Use the master URL like this: 像这样使用主URL：

sc = SparkContext("spark://your_spark_master_url:port", "Test App")

为什么这个简单的Spark程序没有使用多个内核？

问题描述

4 个解决方案

解决方案1
18 2015-08-13 08:02:16

解决方案2
5 已采纳 2014-11-10 14:26:19

解决方案3
1 2014-11-09 14:20:46

解决方案4
1 2015-05-12 12:52:46

为什么这个简单的Spark程序没有使用多个内核？

问题描述

4 个解决方案

解决方案1 18 2015-08-13 08:02:16

解决方案2 5 已采纳 2014-11-10 14:26:19

解决方案3 1 2014-11-09 14:20:46

解决方案4 1 2015-05-12 12:52:46

解决方案1
18 2015-08-13 08:02:16

解决方案2
5 已采纳 2014-11-10 14:26:19

解决方案3
1 2014-11-09 14:20:46

解决方案4
1 2015-05-12 12:52:46