简体   繁体   中英

Parallelism in spark.mllib

Let's say I have an object data of type Array[RDD] . I want to learn independent Machine learning models on each RDD in this object. For example, with random forests:

data.map{ d => RandomForest.trainRegressor(d,2,Map[Int,Int](),2,"auto","gini",2,10) }

When I launch this job with spark-submit --master yarn-client ... , the independent learning tasks does not seem to be parallelized on multiple nodes. Almost all the job is done by only one node (namely node 10 here), as it can be seen on the screenshot from the application UI:

在此输入图像描述

Addendum

For completeness, the whole code is below:

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.RandomForest


object test {
  def main(args: Array[String]) {

    val conf = new SparkConf().setMaster("local").setAppName("test")
    conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    val sc = new SparkContext(conf)

    // Load data
    val rawData = sc.textFile("data/mllib/sample_tree_data.csv")
    val data = rawData.map { line =>
      val parts = line.split(',').map(_.toDouble)
      LabeledPoint(parts(0), Vectors.dense(parts.tail))
    }

    val CV_data = (1 to 100).toArray.map(_ => {val splits = data.randomSplit(Array(0.7, 0.3)) ;  splits(0)})

    CV_data.map(d => RandomForest.trainClassifier(d, 2, Map[Int, Int](), 2, "sqrt", "gini", 2, 100))

    sc.stop()
    System.exit(0)
  }
}

The problem is that RandomForest.trainClassifier can be seen as an action , because it eagerly triggers the execution of some of the involved RDD calculations. Thus, whenever you call RandomForest.trainClassifier , Spark jobs will be submitted to the cluster and executed.

Since the map operation on Scala Array is performed sequentially, you end up executing one trainClassifier job after another. In order to execute the jobs in parallel, you have to call map on a parallel collection. The following code snippet should do the trick:

CV_data.par.map(d => RandomForest.trainClassifier(d, 2, Map[Int, Int](), 2, "sqrt", "gini", 2, 100))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM