简体   繁体   English

如何在Spark / Scala中显示预测,标签和数据框列?

[英]How do I display predictions, labels and dataframe column in Spark/Scala?

I am performing a naive Bayes classification in Spark/Scala. 我在Spark / Scala中执行一个朴素的贝叶斯分类。 It seems to work OK, the code is: 它似乎工作正常,代码是:

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.feature.StringIndexer

val dfLemma2 = dfLemma.withColumn("racist", 'racist.cast("String"))

val indexer = new StringIndexer().setInputCol("racist").setOutputCol("indexracist")
val indexed = indexer.fit(dfLemma2).transform(dfLemma2)
indexed.show()


val hashingTF = new HashingTF()
  .setInputCol("lemma").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(indexed)

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
rescaledData.select("features", "indexracist").take(3).foreach(println)
val changedTypedf = rescaledData.withColumn("indexracist", 'indexracist.cast("double"))
changedTypedf.show()

// val labeled = changedTypedf.map(row => LabeledPoint(row(0), row.getAs[Vector](4)))

val labeled = changedTypedf.select("indexracist","features").rdd.map(row => LabeledPoint(
   row.getAs[Double]("indexracist"),
   org.apache.spark.mllib.linalg.Vectors.fromML(row.getAs[org.apache.spark.ml.linalg.SparseVector]("features"))
))


import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.util.MLUtils
    // Split data into training (60%) and test (40%).
    val Array(training, test) = labeled.randomSplit(Array(0.6, 0.4))

    val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial")

val predictionAndLabel = test.map(p => (model.predict(p.features), p.label))    

predictionAndLabel.take(100)

This outputs: 这输出:

res330: Array[(Double, Double)] = Array((0.0,0.0), (0.0,0.0), (0.0,0.0), (0.0,0.0),

which I assume is an array of (prediction, label) pairs. 我假设是一组(预测,标签)对。 What I would like to output is these pairs joined to the original text, which was a column called lemma in training dataframe, so something like: 我想输出的是这些对连接到原始文本,这是一个名为引理的训练数据帧,所以类似于:

--------------------------------------------------
| Prediction | Label      | lemma                |
--------------------------------------------------
|    0.0     |    0.0     |[cakes, are, good]    |
|    0.0     |    0.0     |[jim, says, hi]       |
|    1.0     |    1.0     |[shut, the, dam, door]|
...
--------------------------------------------------

Any pointers are appreciated as my Spark/Scala is weak. 由于我的Spark / Scala很弱,因此任何指针都会受到赞赏。

EDIT, The text column is called 'lemma' in 'indexed': 编辑,文本列在'索引'中称为'引理':

+------+-------------------------------------------------------------------------------------------------------------------+
|racist|lemma                                                                                                              |
+------+-------------------------------------------------------------------------------------------------------------------+
|true  |[@cllrwood, abbo, @ukip, britainfirst]                                                                             |
|false |[objectofthemonth, george, lansbury, bust, jussuf, abbo, amp, fascinating, insight, son, jerome]                   |
|false |[nowplay, one, night, stand, van, brave, @bbraveofficial, bbravesquad, abbo, safe]                                 |
|false |[@mahesh, weet, son, satyamurthy, kante, abbo, chana, better, aaamovie]                                            |

You just need to transform your data and show them as followed : 您只需要转换数据并按以下方式显示它们:

val predictions = model.transform(test)
predictions.show()

Try using ml package instead of mllib package. 尝试使用ml包而不是mllib包。 for example refer https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/NaiveBayesExample.scala 例如参考https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/NaiveBayesExample.scala

import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.sql.SparkSession

object NaiveBayesExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("NaiveBayesExample")
      .getOrCreate()


    // Load the data stored in LIBSVM format as a DataFrame.
    val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

    // Split the data into training and test sets (30% held out for testing)
    val Array(trainingData, testData) = data.randomSplit(Array(0.6, 0.4))

    // Train a NaiveBayes model.
    val model = new NaiveBayes()
      .fit(trainingData)

    // Select example rows to display.
    val predictions = model.transform(testData)
    predictions.show()

    // Select (prediction, true label) and compute test error
    val evaluator = new MulticlassClassificationEvaluator()
      .setLabelCol("label")
      .setPredictionCol("prediction")
      .setMetricName("accuracy")
    val accuracy = evaluator.evaluate(predictions)
    println("Test set accuracy = " + accuracy)


    spark.stop()
  }
}

As some answers said : It's recommended that you use ml package instead of mllib package since spark 2.0 正如一些答案所说:建议你使用ml包而不是mllib包自spark 2.0

After rewriting your code using ml package, the answer of your question will be very straightforward : just selecting the right columns should answer your needs 使用ml包重写代码后,问题的答案将非常简单:只需选择正确的列就可以满足您的需求

import org.apache.spark.ml.feature.{HashingTF, IDF}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.feature.{VectorAssembler, StringIndexer}

val dfLemma2 = dfLemma.withColumn("racist", 'racist.cast("String"))

val indexer = new StringIndexer().setInputCol("racist").setOutputCol("indexracist")

val hashingTF = new HashingTF()
  .setInputCol("lemma")
  .setOutputCol("rawFeatures")
  .setNumFeatures(20)

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")

val naiveBayes =
  new NaiveBayes().setLabelCol("indexracist").setFeaturesCol("features").setModelType("multinomial").setSmoothing(1.0)

val pipeline = new Pipeline().setStages(Array(indexer, hashingTF, idf, naiveBayes))

val Array(training, test) = dfLemma2.randomSplit(Array(0.6, 0.4))

val model = pipeline.fit(training)

val predictionAndLabel = model.transform(test).select('Prediction, 'racist, 'indexracist, 'lemma)

predictionAndLabel.take(100)

Hope it helps, otherwise comment your problems 希望它有所帮助,否则评论你的问题

While selecting the output columns to be displayed, try to include the "lemma" column also so that it will get written along with label and features columns. 在选择要显示的输出列时,请尝试包含“引理”列,以便它与标签和要素列一起写入。

For more details please refer How to create correct data frame for classification in Spark ML . 有关更多详细信息,请参阅如何在Spark ML中创建用于分类的正确数据框 This post is little bit similar to your question and check whether it helps 这篇文章与你的问题有点相似,并检查它是否有帮助

我们必须使用管道来获取除预测列之外的训练列

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM