I'm trying to make a binary classification on a big dataset (5million rows x 450 features) using XGBoost Spark lib in AWS EMR.
I've attempted setting many different configurations like:
Even though I get different time in performance, when I analyze the cluster load through Ganglia it's always with a low load. I've been trying to maximize the use of resources for a faster classification because I'm running 1000 rounds on XGBoost, but no matter the parameters I set, I always get the same similar usage.
Here's the EMR setup I'm using. Master Node: 1 m4.xlarge Worker Node: 10 m4.2xlarge Total vCores on workers: 160
Some of the different parameters are here: Different Spark and XGBoost config I've tried
I'm performing 1000 num rounds, with 4 folds CrossValidation and some hyper paratemeter tunning (36 possible combinations). It's taking around 1 second for each iteration. The trainig itself will take around 40 hours then.
And the cluster load is really low. Cluster usage
Any tips on what can I do to better use my cluster resources and have a faster training? Is there something I'm missing when setting the number of XGboost workers, Spark executors or other configs? Or, there's nothing else to do and this cluster setup for this specific hardware is just overkill?
1、In XGBoost4J-Spark, each XGBoost worker is wrapped by a Spark task and the training dataset in Spark's memory space is fed to XGBoost workers in a transparent approach to the user.
2 If you do want OpenMP optimization,
you have to set nthread to a value larger than 1 when creating XGBoostClassifier/XGBoostRegressor
set spark.task.cpus in Spark to the same value as nthread
A number of parameters need to be set correctly to make the training take advantage of all resources. Configuration example:
val spark = SparkSession
.builder()
.appName("xgboost on spark, baseline of cross validation")
.master("yarn")
.config("spark.sql.warehouse.dir", warehouse_location)
.config("spark.executor.instances", "10")
.config("spark.executor.memory", "40g")
.config("spark.executor.cores", "8")
.config("spark.dirver.memory", "30g")
.config("spark.dirver.cores", "4")
.config("spark.task.cpus","4")
.enableHiveSupport()
.getOrCreate();
val booster = new XGBoostClassifier(
Map("eta" -> 0.1f,
"max-depth" -> 7,
"objective" -> "multi:softprob",
"num_round" -> 15,
"num_workers" -> 20,
"eval_metric" -> "mlogloss",
"num_class" -> num_class.toInt
)
)
//booster.setNumClass(num_class.toInt)
booster.setGamma(0.5)
booster.setNthread(4)
Above configuration running with 20 xgboost workers, 4 degree of parallelism for each worker.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.