Spark 多类逻辑回归类号和标签

Question

I am running Spark's logistic regression example from here for scala.我正在从这里为 Scala 运行 Spark 的逻辑回归示例。

In the training part:在训练部分：

val model = new LogisticRegressionWithLBFGS().setNumClasses(10).run(training)

number of classes are set to 10. In case my data consists of 3 labels which are 5, 12 and 20;类数设置为 10。如果我的数据包含 3 个标签，分别为 5、12 和 20； it raises an exception such as它引发了一个异常，例如

ERROR DataValidators: Classification labels should be in {0 to 9}. Found 6 invalid labels.

I know that I can resolve it by setting classnum larger than the largest class value.我知道我可以通过将classnum设置classnum大于最大类值来解决它。

Is it possible to run this algorithm with true number of classes on such a dataset without making an explicit transformation on label values?是否可以在这样的数据集上使用真实数量的类运行此算法而不对标签值进行显式转换？

If I run this with a high classnum to make it work, does the algorithm predicts non existent classes such as 17 for example above?如果我使用高classnum运行它以使其工作，算法是否会预测不存在的类，例如上面的 17？

Answer 1

I think the best thing you can do is to map your training data and modify each entry, and by using a Map exchange your labels for 0.0, 1.0, 2.0, ..., n - 1 , where n = number of classes我认为你能做的最好的事情是map你的训练数据并修改每个条目，并通过使用Map将你的labels交换为0.0, 1.0, 2.0, ..., n - 1 ，其中n = number of classes

import org.apache.spark.mllib.regression.LabeledPoint 
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.linalg.Vectors

val rdd = sc.parallelize(List(
  LabeledPoint(5.0, Vectors.dense(1,2)), 
  LabeledPoint(12.0, Vectors.dense(1,3)),
  LabeledPoint(20.0, Vectors.dense(-1,4))))

val map = Map(5 -> 0.0, 12.0 -> 1.0, 20.0 -> 2.0)

val trainingData = rdd.map{
  case LabeledPoint(category, features) => LabeledPoint(map(category), features)
}

val model = new LogisticRegressionWithLBFGS().setNumClasses(3).run(trainingData)

Spark 多类逻辑回归类号和标签

问题描述

1 个解决方案

解决方案1
1 2016-03-27 14:01:49

Spark 多类逻辑回归类号和标签

问题描述

1 个解决方案

解决方案1 1 2016-03-27 14:01:49

解决方案1
1 2016-03-27 14:01:49