Parallelism in spark.mllib

Question

Let's say I have an object data of type Array[RDD] . I want to learn independent Machine learning models on each RDD in this object. For example, with random forests:

data.map{ d => RandomForest.trainRegressor(d,2,Map[Int,Int](),2,"auto","gini",2,10) }

When I launch this job with spark-submit --master yarn-client ... , the independent learning tasks does not seem to be parallelized on multiple nodes. Almost all the job is done by only one node (namely node 10 here), as it can be seen on the screenshot from the application UI:

Addendum

For completeness, the whole code is below:

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.RandomForest


object test {
  def main(args: Array[String]) {

    val conf = new SparkConf().setMaster("local").setAppName("test")
    conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    val sc = new SparkContext(conf)

    // Load data
    val rawData = sc.textFile("data/mllib/sample_tree_data.csv")
    val data = rawData.map { line =>
      val parts = line.split(',').map(_.toDouble)
      LabeledPoint(parts(0), Vectors.dense(parts.tail))
    }

    val CV_data = (1 to 100).toArray.map(_ => {val splits = data.randomSplit(Array(0.7, 0.3)) ;  splits(0)})

    CV_data.map(d => RandomForest.trainClassifier(d, 2, Map[Int, Int](), 2, "sqrt", "gini", 2, 100))

    sc.stop()
    System.exit(0)
  }
}

Answer 1

The problem is that RandomForest.trainClassifier can be seen as an action , because it eagerly triggers the execution of some of the involved RDD calculations. Thus, whenever you call RandomForest.trainClassifier , Spark jobs will be submitted to the cluster and executed.

Since the map operation on Scala Array is performed sequentially, you end up executing one trainClassifier job after another. In order to execute the jobs in parallel, you have to call map on a parallel collection. The following code snippet should do the trick:

CV_data.par.map(d => RandomForest.trainClassifier(d, 2, Map[Int, Int](), 2, "sqrt", "gini", 2, 100))

Parallelism in spark.mllib

Question

1 answers

solution1
2 ACCPTED 2015-12-16 13:40:57

Parallelism in spark.mllib

Question

1 answers

solution1 2 ACCPTED 2015-12-16 13:40:57

solution1
2 ACCPTED 2015-12-16 13:40:57