简体   繁体   中英

Can not reach better speed up in Apache Spark for some small datasets when increase the number of workers

I use Spark to classify the Census-income dataset with Random Forest algorithm. My workers (w1 & w2) have one CPU (4 core). When I configure the Spark to use only w1 as the worker, it takes about 12 seconds to construct the model. When I configure the Spark to use both w1 & w2 as the worker, it takes again 12 seconds to construct the model. I can see with 'htop' that both worker's CPU usage goes high when I run the code. However, I expect to reach a lower execution time., When I use larger datasets. I can reach a less execution time.

Here is my code snippet:

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "data/Census-income.libsvm")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a RandomForest model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 3 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 4
val maxBins = 32

val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
  val prediction = model.predict(point.features)
(point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println(s"Test Error = $testErr")

What is the problem? Any comment is appreciated.

Amdahl's Law and Gunther's Law both apply here.

Amdahl's Law essentially says that performance from parallelization cannot increase by more than the inverse of the relative portion of work which is non-parallelizable (relative to the total work): if half of the work is parallelizable and half isn't, then adding an infinite number of workers could at most make the parallelizable half be instantaneous, but the non-parallelizable half would be unchanged: you'd thus basically double the speed.

Gunther's Law in turn implies that each added worker imposes some extra contention and coherency delays: adding workers only improves performance if the gain from the added worker exceeds this non-zero performance penalty.

In a Spark job, there's a nonzero amount of non-parallelizable work (eg setting up the driver for the job). The amount of parallelizable work is bounded by the size of the dataset: the smaller the dataset, the more of the workload is non-parallelizable.

Additionally, some operations done in the Spark job are going to impose contention and coherency delays further reducing the benefit of adding more workers (I have no experience in whether Spark MLLib performs such operations).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM