简体   繁体   中英

How to deal with more than one categorical feature in a decision tree?

I read a piece of code about binary decision tree from a book. It has only one categorical feature, which is field(3), in the raw data, and is converted to one-of-k(one-hot encoding).

def PrepareData(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint], RDD[LabeledPoint], Map[String, Int]) = {

  val rawDataWithHeader = sc.textFile("data/train.tsv")
  val rawData = rawDataWithHeader.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
  val lines = rawData.map(_.split("\t"))


  val categoriesMap = lines.map(fields => fields(3)).distinct.collect.zipWithIndex.toMap
  val labelpointRDD = lines.map { fields =>
     val trFields = fields.map(_.replaceAll("\"", ""))
     val categoryFeaturesArray = Array.ofDim[Double](categoriesMap.size)
     val categoryIdx = categoriesMap(fields(3))
     categoryFeaturesArray(categoryIdx) = 1
     val numericalFeatures = trFields.slice(4, fields.size - 1).map(d => if (d == "?") 0.0 else d.toDouble)
     val label = trFields(fields.size - 1).toInt
     LabeledPoint(label, Vectors.dense(categoryFeaturesArray ++ numericalFeatures))
  }

  val Array(trainData, validationData, testData) = labelpointRDD.randomSplit(Array(8, 1, 1))
  return (trainData, validationData, testData, categoriesMap)
}

I wonder how to revise the code if there are several categorical features in the raw data, let's say field(3), field(5), field(7) are all categorical features.

I revised the first line:

def PrepareData(sc: SparkContext): (RDD[LabeledPoint], RDD[LabeledPoint], RDD[LabeledPoint], Map[String, Int], Map[String, Int], Map[String, Int], Map[String, Int]) =......

Then, I converted another two fields into 1-of-k encoding as it was done like:

val categoriesMap5 = lines.map(fields => fields(5)).distinct.collect.zipWithIndex.toMap
val categoriesMap7 = lines.map(fields => fields(7)).distinct.collect.zipWithIndex.toMap
val categoryFeaturesArray5 = Array.ofDim[Double](categoriesMap5.size)
val categoryFeaturesArray7 = Array.ofDim[Double](categoriesMap7.size)
val categoryIdx3 = categoriesMap5(fields(5))
val categoryIdx5 = categoriesMap7(fields(7))
categoryFeaturesArray5(categoryIdx5) = 1
categoryFeaturesArray7(categoryIdx7) = 1

Finally, I revised LabeledPoint and return like:

LabeledPoint(label, Vectors.dense(categoryFeaturesArray ++ categoryFeaturesArray5 ++ categoryFeaturesArray7 ++ numericalFeatures))
return (trainData, validationData, testData, categoriesMap, categoriesMap5, categoriesMap7)

Is it correct?

==================================================

The second problem I encountered is: the following code from that book, in the trainModel, it uses

  DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity, maxDepth, maxBins)

Here is the code:

def trainModel(trainData: RDD[LabeledPoint], impurity: String, maxDepth: Int, maxBins: Int): (DecisionTreeModel, Double) = {
   val startTime = new DateTime()
   val model = DecisionTree.trainClassifier(trainData, 2, Map[Int, Int](), impurity, maxDepth, maxBins)
   val endTime = new DateTime()
   val duration = new Duration(startTime, endTime)
   (model, duration.getMillis())
}

The question is: how do I pass the categoricalFeaturesInfo into this method if it has three categorical features mentioned previously?

I just want to follow the step on the book to build up a prediction system on my own by using a decision tree. To be more specific, the data sets I chose has several categorical features like : Gender: male, female

Education: HS-grad, Bachelors, Master, PH.D, ......

Country: US, Canada, England, Australia, ......

But I don't know how to merge them into one single categoryFeatures ++ numericalFeatures to put into Vector.dense() , and one single categoricalFeaturesInfo to put into DecisionTree.trainRegressor()

It is not clear for me what exactly you're doing here but it looks like it is wrong from the beginning.

Ignoring the fact that you're reinventing the wheel by implementing one-hot-encoding from scratch, the whole point of encoding is to convert categorical variables to numerical ones. This is required for linear models but arguably it doesn't make sense when working with decision trees.

Keeping that in mind you have two choices:

  • Index categorical fields without encoding and pass indexed features to categoricalFeaturesInfo .
  • One-hot-encode categorical features and treat these as numerical variables.

I believe that the former approach is the right approach. The latter one should work in practice but it just artificially increases dimensionality without providing any benefits. It may also be in conflict with some heuristics used by Spark implementation.

One way or another you should consider using ML Pipelines which provide all required indexing, encoding, and merging tools.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM