[英]Spark multiclass logistic regression class number and labels
I am running Spark's logistic regression example from here for scala.我正在从这里为 Scala 运行 Spark 的逻辑回归示例。
In the training part:在训练部分:
val model = new LogisticRegressionWithLBFGS().setNumClasses(10).run(training)
number of classes are set to 10. In case my data consists of 3 labels which are 5, 12 and 20;类数设置为 10。如果我的数据包含 3 个标签,分别为 5、12 和 20; it raises an exception such as
它引发了一个异常,例如
ERROR DataValidators: Classification labels should be in {0 to 9}. Found 6 invalid labels.
I know that I can resolve it by setting classnum
larger than the largest class value.我知道我可以通过将
classnum
设置classnum
大于最大类值来解决它。
Is it possible to run this algorithm with true number of classes on such a dataset without making an explicit transformation on label values?是否可以在这样的数据集上使用真实数量的类运行此算法而不对标签值进行显式转换?
If I run this with a high classnum
to make it work, does the algorithm predicts non existent classes such as 17 for example above?如果我使用高
classnum
运行它以使其工作,算法是否会预测不存在的类,例如上面的 17?
I think the best thing you can do is to map
your training data and modify each entry, and by using a Map
exchange your labels
for 0.0, 1.0, 2.0, ..., n - 1
, where n = number of classes
我认为你能做的最好的事情是
map
你的训练数据并修改每个条目,并通过使用Map
将你的labels
交换为0.0, 1.0, 2.0, ..., n - 1
,其中n = number of classes
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.linalg.Vectors
val rdd = sc.parallelize(List(
LabeledPoint(5.0, Vectors.dense(1,2)),
LabeledPoint(12.0, Vectors.dense(1,3)),
LabeledPoint(20.0, Vectors.dense(-1,4))))
val map = Map(5 -> 0.0, 12.0 -> 1.0, 20.0 -> 2.0)
val trainingData = rdd.map{
case LabeledPoint(category, features) => LabeledPoint(map(category), features)
}
val model = new LogisticRegressionWithLBFGS().setNumClasses(3).run(trainingData)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.