在 Apache Spark 中，当增加工人数量时，对于一些小数据集无法达到更好的加速

Question

I use Spark to classify the Census-income dataset with Random Forest algorithm.我使用 Spark 使用随机森林算法对人口普查收入数据集进行分类。 My workers (w1 & w2) have one CPU (4 core).我的工人（w1 和 w2）有一个 CPU（4 核）。 When I configure the Spark to use only w1 as the worker, it takes about 12 seconds to construct the model.当我将 Spark 配置为仅使用 w1 作为 worker 时，构建 model 大约需要 12 秒。 When I configure the Spark to use both w1 & w2 as the worker, it takes again 12 seconds to construct the model.当我将 Spark 配置为同时使用 w1 和 w2 作为 worker 时，构建 model 再次需要 12 秒。 I can see with 'htop' that both worker's CPU usage goes high when I run the code.我可以通过“htop”看到，当我运行代码时，两个工作人员的 CPU 使用率都很高。 However, I expect to reach a lower execution time., When I use larger datasets.但是，当我使用更大的数据集时，我希望达到更低的执行时间。 I can reach a less execution time.我可以达到更少的执行时间。

Here is my code snippet:这是我的代码片段：

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "data/Census-income.libsvm")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a RandomForest model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 3 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 4
val maxBins = 32

val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
  val prediction = model.predict(point.features)
(point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println(s"Test Error = $testErr")

What is the problem?问题是什么？ Any comment is appreciated.任何评论表示赞赏。

Answer 1

Amdahl's Law and Gunther's Law both apply here. Amdahl 定律和Gunther 定律都适用于此。

Amdahl's Law essentially says that performance from parallelization cannot increase by more than the inverse of the relative portion of work which is non-parallelizable (relative to the total work): if half of the work is parallelizable and half isn't, then adding an infinite number of workers could at most make the parallelizable half be instantaneous, but the non-parallelizable half would be unchanged: you'd thus basically double the speed.阿姆达尔定律本质上说，并行化的性能不能增加超过不可并行化的工作的相对部分的倒数（相对于总工作）：如果一半的工作是可并行的，而另一半不是，那么添加一个无限数量的工人最多可以使可并行化的一半是瞬时的，但不可并行化的一半将保持不变：因此您基本上可以将速度提高一倍。

Gunther's Law in turn implies that each added worker imposes some extra contention and coherency delays: adding workers only improves performance if the gain from the added worker exceeds this non-zero performance penalty.冈瑟定律反过来意味着每个增加的工人都会施加一些额外的争用和一致性延迟：增加工人只有在增加的工人的收益超过这个非零的性能损失时才能提高性能。

In a Spark job, there's a nonzero amount of non-parallelizable work (eg setting up the driver for the job).在 Spark 作业中，有非零数量的不可并行化的工作（例如，为作业设置驱动程序）。 The amount of parallelizable work is bounded by the size of the dataset: the smaller the dataset, the more of the workload is non-parallelizable.可并行化的工作量受数据集大小的限制：数据集越小，不可并行化的工作负载就越多。

Additionally, some operations done in the Spark job are going to impose contention and coherency delays further reducing the benefit of adding more workers (I have no experience in whether Spark MLLib performs such operations).此外，在 Spark 作业中完成的一些操作将导致争用和一致性延迟，进一步降低了添加更多工作人员的好处（我对 Spark MLLib 是否执行此类操作没有经验）。

在 Apache Spark 中，当增加工人数量时，对于一些小数据集无法达到更好的加速

问题描述

1 个解决方案

解决方案1
0 2020-07-09 13:56:35

在 Apache Spark 中，当增加工人数量时，对于一些小数据集无法达到更好的加速

问题描述

1 个解决方案

解决方案1 0 2020-07-09 13:56:35

解决方案1
0 2020-07-09 13:56:35