如何使用Dataframe API计算Spark MLlib中的二进制分类指标

Question

I am using Spark MLlib with DataFrame API, given the following sample code: 给定以下示例代码，我将Spark MLlib与DataFrame API结合使用：

val dtc = new DecisionTreeClassifier()
val testResults = dtc.fit(training).transform(test)

Can I calculate the model quality metrics over the testResult using the DataFrame API? 我可以使用DataFrame API在testResult计算模型质量指标吗？

If not, how do I correctly transform my testResult (containing "label", "features", "rawPrediction", "probability", "prediction") so that I can use the BinaryClassificationMetrics (RDD API)? 如果没有，如何正确转换testResult （包含“标签”，“功能”，“ rawPrediction”，“概率”，“预测”），以便可以使用BinaryClassificationMetrics （RDD API）？

NOTE: I am interested in the "byThreshold" metrics as well 注意：我也对“ byThreshold”指标感兴趣

Answer 1

If you look at the constructor of the BinaryClassificationMetrics , it takes an RDD[(Double, Double)], score and labels. 如果您查看BinaryClassificationMetrics的构造函数，它将使用RDD [（Double，Double）]，得分和标签。 You can convert the Dataframe to the right format like this: 您可以将数据框转换为正确的格式，如下所示：

val scoreAndLabels = testResults.select("label", "probability")
    .rdd
    .map(row => 
            (row.getAs[Vector]("probability")(1), row.getAs[Double]("label"))
    )

EDIT: 编辑：

Probability is stored in a Vector that is the same length as the number of classes you'd like to predict. 概率存储在Vector ，该Vector的长度与您要预测的类数相同。 In the case of binary classification the first one would correspond to label = 0 and the second is label = 1, you should pick the column that is your positive label (normally label = 1). 对于二进制分类，第一个对应于label = 0，第二个对应于label = 1，您应该选择作为您的肯定标签的列（通常，label = 1）。

如何使用Dataframe API计算Spark MLlib中的二进制分类指标

问题描述

1 个解决方案

解决方案1
3 2017-05-13 20:12:00

如何使用Dataframe API计算Spark MLlib中的二进制分类指标

问题描述

1 个解决方案

解决方案1 3 2017-05-13 20:12:00

解决方案1
3 2017-05-13 20:12:00