Let's say I have an object data
of type Array[RDD]
. I want to learn independent Machine learning models on each RDD
in this object. For example, with random forests:
data.map{ d => RandomForest.trainRegressor(d,2,Map[Int,Int](),2,"auto","gini",2,10) }
When I launch this job with spark-submit --master yarn-client ...
, the independent learning tasks does not seem to be parallelized on multiple nodes. Almost all the job is done by only one node (namely node 10 here), as it can be seen on the screenshot from the application UI:
Addendum
For completeness, the whole code is below:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.RandomForest
object test {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("test")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)
// Load data
val rawData = sc.textFile("data/mllib/sample_tree_data.csv")
val data = rawData.map { line =>
val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts(0), Vectors.dense(parts.tail))
}
val CV_data = (1 to 100).toArray.map(_ => {val splits = data.randomSplit(Array(0.7, 0.3)) ; splits(0)})
CV_data.map(d => RandomForest.trainClassifier(d, 2, Map[Int, Int](), 2, "sqrt", "gini", 2, 100))
sc.stop()
System.exit(0)
}
}
The problem is that RandomForest.trainClassifier
can be seen as an action , because it eagerly triggers the execution of some of the involved RDD calculations. Thus, whenever you call RandomForest.trainClassifier
, Spark jobs will be submitted to the cluster and executed.
Since the map
operation on Scala Array
is performed sequentially, you end up executing one trainClassifier
job after another. In order to execute the jobs in parallel, you have to call map
on a parallel collection. The following code snippet should do the trick:
CV_data.par.map(d => RandomForest.trainClassifier(d, 2, Map[Int, Int](), 2, "sqrt", "gini", 2, 100))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.