如何通过Spark ml lib中的交叉验证获得准确性，召回率和ROC？

Question

I am using Spark 2.0.2. 我正在使用Spark 2.0.2。 I am also using the "ml" library for Machine Learning with Datasets. 我还使用“ ml”库进行带有数据集的机器学习。 What I want to do is run algorithms with cross validation and extract the mentioned metrics (accuracy, precision, recall, ROC, confusion matrix). 我想做的是运行带有交叉验证的算法，并提取所提到的指标（准确性，精度，召回率，ROC，混淆矩阵）。 My data labels are binary. 我的数据标签是二进制的。

By using the MulticlassClassificationEvaluator I can only get the accuracy of the algorithm by accessing "avgMetrics". 通过使用MulticlassClassificationEvaluator，我只能通过访问“ avgMetrics”来获得算法的准确性。 Also, by using the BinaryClassificationEvaluator I can get the area under ROC. 另外，通过使用BinaryClassificationEvaluator，我可以得到ROC下的面积。 But I cannot use them both. 但是我不能同时使用它们。 So, is there a way that I can extract all of the wanted metrics? 因此，有没有一种方法可以提取所有想要的指标？

Answer 1

Have tried to use MLlib to evaluate your result. 尝试使用MLlib评估您的结果。

I've transformed the dataset to RDD, then used MulticlassMetrics in MLlib 我已经改变了数据集RDD，然后用MulticlassMetrics在MLlib

You can see a demo here: Spark DecisionTreeExample.scala 您可以在此处查看演示： Spark DecisionTreeExample.scala

private[ml] def evaluateClassificationModel(
      model: Transformer,
      data: DataFrame,
      labelColName: String): Unit = {
    val fullPredictions = model.transform(data).cache()
    val predictions = fullPredictions.select("prediction").rdd.map(_.getDouble(0))
    val labels = fullPredictions.select(labelColName).rdd.map(_.getDouble(0))
    // Print number of classes for reference.
    val numClasses = MetadataUtils.getNumClasses(fullPredictions.schema(labelColName)) match {
      case Some(n) => n
      case None => throw new RuntimeException(
        "Unknown failure when indexing labels for classification.")
    }
    val accuracy = new MulticlassMetrics(predictions.zip(labels)).accuracy
    println(s"  Accuracy ($numClasses classes): $accuracy")
  }

Answer 2

You can follow the official Evaluation Metrics guide provided by Apache Spark. 您可以遵循Apache Spark提供的官方评估指标指南 。 The document has provided all the Evaluation Metrics including 该文件提供了所有评估指标，包括

Precision (Positive Predictive Value), Recall (True Positive Rate), F-measure, Receiver Operating Characteristic (ROC), Area Under ROC Curve, Area Under Precision-Recall Curve. 精度（正预测值），召回率（真正值），F量度，接收器工作特性（ROC），ROC曲线下的面积，精确召回曲线下的面积。

Here is the link : https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html 这是链接： https : //spark.apache.org/docs/latest/mllib-evaluation-metrics.html

如何通过Spark ml lib中的交叉验证获得准确性，召回率和ROC？

问题描述

2 个解决方案

解决方案1
2 2018-01-23 02:52:19

解决方案2
1 2017-01-18 09:24:54

如何通过Spark ml lib中的交叉验证获得准确性，召回率和ROC？

问题描述

2 个解决方案

解决方案1 2 2018-01-23 02:52:19

解决方案2 1 2017-01-18 09:24:54

解决方案1
2 2018-01-23 02:52:19

解决方案2
1 2017-01-18 09:24:54