简体   繁体   中英

Spark multiclass logistic regression class number and labels

I am running Spark's logistic regression example from here for scala.

In the training part:

val model = new LogisticRegressionWithLBFGS().setNumClasses(10).run(training)

number of classes are set to 10. In case my data consists of 3 labels which are 5, 12 and 20; it raises an exception such as

ERROR DataValidators: Classification labels should be in {0 to 9}. Found 6 invalid labels.

I know that I can resolve it by setting classnum larger than the largest class value.

Is it possible to run this algorithm with true number of classes on such a dataset without making an explicit transformation on label values?

If I run this with a high classnum to make it work, does the algorithm predicts non existent classes such as 17 for example above?

I think the best thing you can do is to map your training data and modify each entry, and by using a Map exchange your labels for 0.0, 1.0, 2.0, ..., n - 1 , where n = number of classes

import org.apache.spark.mllib.regression.LabeledPoint 
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.linalg.Vectors

val rdd = sc.parallelize(List(
  LabeledPoint(5.0, Vectors.dense(1,2)), 
  LabeledPoint(12.0, Vectors.dense(1,3)),
  LabeledPoint(20.0, Vectors.dense(-1,4))))

val map = Map(5 -> 0.0, 12.0 -> 1.0, 20.0 -> 2.0)

val trainingData = rdd.map{
  case LabeledPoint(category, features) => LabeledPoint(map(category), features)
}

val model = new LogisticRegressionWithLBFGS().setNumClasses(3).run(trainingData)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM