Spark multiclass logistic regression class number and labels

Question

I am running Spark's logistic regression example from here for scala.

In the training part:

val model = new LogisticRegressionWithLBFGS().setNumClasses(10).run(training)

number of classes are set to 10. In case my data consists of 3 labels which are 5, 12 and 20; it raises an exception such as

ERROR DataValidators: Classification labels should be in {0 to 9}. Found 6 invalid labels.

I know that I can resolve it by setting classnum larger than the largest class value.

Is it possible to run this algorithm with true number of classes on such a dataset without making an explicit transformation on label values?

If I run this with a high classnum to make it work, does the algorithm predicts non existent classes such as 17 for example above?

Answer 1

I think the best thing you can do is to map your training data and modify each entry, and by using a Map exchange your labels for 0.0, 1.0, 2.0, ..., n - 1 , where n = number of classes

import org.apache.spark.mllib.regression.LabeledPoint 
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.linalg.Vectors

val rdd = sc.parallelize(List(
  LabeledPoint(5.0, Vectors.dense(1,2)), 
  LabeledPoint(12.0, Vectors.dense(1,3)),
  LabeledPoint(20.0, Vectors.dense(-1,4))))

val map = Map(5 -> 0.0, 12.0 -> 1.0, 20.0 -> 2.0)

val trainingData = rdd.map{
  case LabeledPoint(category, features) => LabeledPoint(map(category), features)
}

val model = new LogisticRegressionWithLBFGS().setNumClasses(3).run(trainingData)

Spark multiclass logistic regression class number and labels

Question

1 answers

solution1
1 2016-03-27 14:01:49

Spark multiclass logistic regression class number and labels

Question

1 answers

solution1 1 2016-03-27 14:01:49

solution1
1 2016-03-27 14:01:49