[英]Multiclass classification, show raw predictions better in Scala with Spark
Working with the Iris dataset (LogisticRegressionWithLBFGS(), multiclass classification).使用 Iris 数据集(LogisticRegressionWithLBFGS(),多类分类)。 I pulled my data into an rdd, converted to a Dataframe, done some tidying up on it.
我将数据提取到 rdd 中,转换为 Dataframe,对其进行了一些整理。 Created a labelindex on the Iris plant class/label field.
在鸢尾植物类/标签字段上创建了一个标签索引。 Created a feature vector of the other fields.
创建了其他字段的特征向量。 Took these two fields of a dataframe and converted into a labelpoint rdd instance, where I can feed the data into LogisticRegressionWithLBFGS().
获取 dataframe 的这两个字段并转换为 labelpoint rdd 实例,我可以将数据输入 LogisticRegressionWithLBFGS()。
Here is some predictor code:这是一些预测器代码:
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(10)
.setIntercept(true)
.setValidateData(true)
.run(training)
Scores and labels:分数和标签:
val scoreAndLabels_ofTrain = training.map {
point =>
val score = model.predict(point.features)
(score, point.label)
}
I wanted to see the predictions我想看看预测
scoreAndLabels_ofTrain.take(200).foreach(println)
The only problem is, I got this example from a book, pretty much.唯一的问题是,我几乎从书中得到了这个例子。 I was kind hoping to see a dataset, that shows the feature columns, what the predicted number was, what probability score it gave, etc I'd imagine I'd need to do a conversion of the labelindex, if i wanted to see the string data they represent.
我很希望看到一个数据集,它显示了特征列、预测的数字是什么、它给出的概率分数等我想如果我想看的话,我需要对 labelindex 进行转换它们代表的字符串数据。
How do I get better looking, tabular data as close as possible to the original dataset, with predictions against them?如何获得更好看的表格数据,尽可能接近原始数据集,并针对它们进行预测? I think i'm missing a trick here somewhere.
我想我在这里的某个地方错过了一个技巧。
The output to above looks like:上面的 output 看起来像:
(2.0,2.0)
(2.0,2.0)
(2.0,2.0)
(2.0,2.0)
(2.0,2.0)
...
What does this even mean?这甚至意味着什么? Not sure how to read/interpret the data For the first line,is it saying, it predicted "2.0", and the actual label was "2.0"?
不知道如何读取/解释数据 对于第一行,是不是说它预测“2.0”,而实际的 label 是“2.0”? Am I understanding it right?
我理解对了吗?
Yes, what you have is the (Label,Prediction) in form of a RDD[(Double, Double)] when you apply the map to the input dataset and make the prediction for each element.是的,当您将 map 应用于输入数据集并对每个元素进行预测时,您所拥有的是RDD[(Double, Double)]形式的 (Label,Prediction)。 But, you are using the mlib LR implementation.
但是,您使用的是 mlib LR 实现。 You can use directly the Dataframe implementation.
您可以直接使用 Dataframe 实现。 Take a look to the example .
看看这个例子。 The fit function optimizes the model and return a LogisticRagressionModel .
拟合 function 优化 model 并返回LogisticRagressionModel 。 Apply the transform method to your input Dataframe and a new column with the prediction will be added.
将转换方法应用于您的输入 Dataframe 并且将添加一个带有预测的新列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.