[英]How to calculate Binary Classification Metrics in Spark MLlib with Dataframe API
I am using Spark MLlib with DataFrame API, given the following sample code: 给定以下示例代码,我将Spark MLlib与DataFrame API结合使用:
val dtc = new DecisionTreeClassifier()
val testResults = dtc.fit(training).transform(test)
Can I calculate the model quality metrics over the testResult
using the DataFrame API? 我可以使用DataFrame API在
testResult
计算模型质量指标吗?
If not, how do I correctly transform my testResult
(containing "label", "features", "rawPrediction", "probability", "prediction") so that I can use the BinaryClassificationMetrics
(RDD API)? 如果没有,如何正确转换
testResult
(包含“标签”,“功能”,“ rawPrediction”,“概率”,“预测”),以便可以使用BinaryClassificationMetrics
(RDD API)?
NOTE: I am interested in the "byThreshold" metrics as well 注意:我也对“ byThreshold”指标感兴趣
If you look at the constructor of the BinaryClassificationMetrics , it takes an RDD[(Double, Double)], score and labels. 如果您查看BinaryClassificationMetrics的构造函数 ,它将使用RDD [(Double,Double)],得分和标签。 You can convert the Dataframe to the right format like this:
您可以将数据框转换为正确的格式,如下所示:
val scoreAndLabels = testResults.select("label", "probability")
.rdd
.map(row =>
(row.getAs[Vector]("probability")(1), row.getAs[Double]("label"))
)
EDIT: 编辑:
Probability is stored in a Vector
that is the same length as the number of classes you'd like to predict. 概率存储在
Vector
,该Vector
的长度与您要预测的类数相同。 In the case of binary classification the first one would correspond to label = 0 and the second is label = 1, you should pick the column that is your positive label (normally label = 1). 对于二进制分类,第一个对应于label = 0,第二个对应于label = 1,您应该选择作为您的肯定标签的列(通常,label = 1)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.