简体   繁体   English

多类分类,使用 Spark 在 Scala 中更好地显示原始预测

[英]Multiclass classification, show raw predictions better in Scala with Spark

Working with the Iris dataset (LogisticRegressionWithLBFGS(), multiclass classification).使用 Iris 数据集(LogisticRegressionWithLBFGS(),多类分类)。 I pulled my data into an rdd, converted to a Dataframe, done some tidying up on it.我将数据提取到 rdd 中,转换为 Dataframe,对其进行了一些整理。 Created a labelindex on the Iris plant class/label field.在鸢尾植物类/标签字段上创建了一个标签索引。 Created a feature vector of the other fields.创建了其他字段的特征向量。 Took these two fields of a dataframe and converted into a labelpoint rdd instance, where I can feed the data into LogisticRegressionWithLBFGS().获取 dataframe 的这两个字段并转换为 labelpoint rdd 实例,我可以将数据输入 LogisticRegressionWithLBFGS()。

Here is some predictor code:这是一些预测器代码:

val model = new LogisticRegressionWithLBFGS()
  .setNumClasses(10)
  .setIntercept(true)
  .setValidateData(true)
  .run(training)

Scores and labels:分数和标签:

val scoreAndLabels_ofTrain = training.map {
  point =>
    val score = model.predict(point.features)
    (score, point.label)
}

I wanted to see the predictions我想看看预测

scoreAndLabels_ofTrain.take(200).foreach(println)

The only problem is, I got this example from a book, pretty much.唯一的问题是,我几乎从书中得到了这个例子。 I was kind hoping to see a dataset, that shows the feature columns, what the predicted number was, what probability score it gave, etc I'd imagine I'd need to do a conversion of the labelindex, if i wanted to see the string data they represent.我很希望看到一个数据集,它显示了特征列、预测的数字是什么、它给出的概率分数等我想如果我想看的话,我需要对 labelindex 进行转换它们代表的字符串数据。

How do I get better looking, tabular data as close as possible to the original dataset, with predictions against them?如何获得更好看的表格数据,尽可能接近原始数据集,并针对它们进行预测? I think i'm missing a trick here somewhere.我想我在这里的某个地方错过了一个技巧。

The output to above looks like:上面的 output 看起来像:

(2.0,2.0)
(2.0,2.0)
(2.0,2.0)
(2.0,2.0)
(2.0,2.0)
...

What does this even mean?这甚至意味着什么? Not sure how to read/interpret the data For the first line,is it saying, it predicted "2.0", and the actual label was "2.0"?不知道如何读取/解释数据 对于第一行,是不是说它预测“2.0”,而实际的 label 是“2.0”? Am I understanding it right?我理解对了吗?

Yes, what you have is the (Label,Prediction) in form of a RDD[(Double, Double)] when you apply the map to the input dataset and make the prediction for each element.是的,当您将 map 应用于输入数据集并对每个元素进行预测时,您所拥有的是RDD[(Double, Double)]形式的 (Label,Prediction)。 But, you are using the mlib LR implementation.但是,您使用的是 mlib LR 实现。 You can use directly the Dataframe implementation.您可以直接使用 Dataframe 实现。 Take a look to the example .看看这个例子 The fit function optimizes the model and return a LogisticRagressionModel .拟合 function 优化 model 并返回LogisticRagressionModel Apply the transform method to your input Dataframe and a new column with the prediction will be added.将转换方法应用于您的输入 Dataframe 并且将添加一个带有预测的新列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM