为什么Spark不使用本地计算机上的所有核心

Question

When I run some of the Apache Spark examples in the Spark-Shell or as a job, I am not able to achieve full core utilization on a single machine. 当我在Spark-Shell中运行一些Apache Spark示例或作为一项工作时，我无法在一台机器上实现完全核心利用率。 For example: 例如：

var textColumn = sc.textFile("/home/someuser/largefile.txt").cache()
var distinctWordCount = textColumn.flatMap(line => line.split('\0'))
                             .map(word => (word, 1))
                             .reduceByKey(_+_)
                             .count()

When running this script, I mostly see only 1 or 2 active cores on my 8 core machine. 运行此脚本时，我通常只在我的8核计算机上看到1个或2个活动核心。 Isn't Spark supposed to parallelise this? Spark不应该和它并行吗？

Answer 1

您可以使用local[*]在本地运行Spark，其中包含与您的计算机一样多的工作线程。

Answer 2

when you run a local spark shell, you still have to specify the number of cores that your spark tasks will use. 当您运行本地spark shell时，仍然需要指定spark任务将使用的核心数。 if you want to use 8 cores make sure you 如果你想使用8核心确保你

export MASTER=local[8]

before running your shell. 在运行shell之前。

Also, as @zsxwing says, you may need to ensure that your data is partitioned into enough partitions to keep all of the cores busy, or that you specify the amount of parallelism you want to see. 此外，正如@zsxwing所说，您可能需要确保将数据划分为足够的分区以保持所有核心繁忙，或者指定要查看的并行数量。

为什么Spark不使用本地计算机上的所有核心

问题描述

2 个解决方案

解决方案1
5 2016-06-08 12:32:27

解决方案2
1 2014-05-16 17:16:47

为什么Spark不使用本地计算机上的所有核心

问题描述

2 个解决方案

解决方案1 5 2016-06-08 12:32:27

解决方案2 1 2014-05-16 17:16:47

解决方案1
5 2016-06-08 12:32:27

解决方案2
1 2014-05-16 17:16:47