sklearn如何计算二进制分类器的roc曲线下的面积？

Question

This might appear as a duplication of another question which has been asked here . 这可能是这里另一个问题的重复。 However, I've looked at the answer there and still cannot understand how Scikit-learn calculates the area under the roc curve by testing only one threshold, which is the one provided in the: 但是，我看过那里的答案，但仍然无法理解Scikit-learn如何通过仅测试一个阈值来计算roc曲线下的面积，该阈值是以下内容中提供的：

y_pred = clf.predict(X_test) roc_auc_score(y_test, y_pred)

why it doesn't take multiple values (multiple y_test, y_pred that results from multiple thresholds)?any simplified explanation would be really appreciated. 为什么不采用多个值（多个y_test和y_pred由多个阈值产生）？任何简化的解释将不胜感激。

Answer 1

The second argument for roc_auc_score() in this case should be the prediction probability obtained by clf.predict_proba(X_test) . 在这种情况下， roc_auc_score()的第二个参数应该是clf.predict_proba(X_test)获得的预测概率。 The different thresholds are calculated inside this function on the basis of this prediction probabilities. 根据此预测概率在此函数内部计算不同的阈值。 There is an example for this in the documentation : 文档中有一个示例：

import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
roc_auc_score(y_true, y_scores)
0.75

To understand how the roc_auc_score is caluclated it might be helpful to look at the roc_curve it self. 要了解roc_auc_score方式，查看roc_curve本身可能会有所帮助。 This can be done with the function sklearn.metrics.roc_curve() . 这可以通过功能sklearn.metrics.roc_curve() 。 Example taken from the documentation : 来自文档的示例：

import numpy as np
from sklearn import metrics
y = np.array([1, 1, 2, 2])
scores = np.array([0.1, 0.4, 0.35, 0.8])
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2)
fpr
array([ 0. ,  0.5,  0.5,  1. ])
tpr
array([ 0.5,  0.5,  1. ,  1. ])
thresholds
array([ 0.8 ,  0.4 ,  0.35,  0.1 ])

(Eventhough the y is different in the latter example it is still a binary classification with 2 being the positive class.) （尽管y在后面的示例中是不同的，但它仍然是二进制分类，其中2为正类。）

As can be seen in the latter example the different thresholds are taken from the supplied scores . 从后面的示例中可以看出，从提供的scores中提取了不同的阈值。

The ROC Curve would be generated by putting the True Positive Rate tpr on the y-axis and the False Positive Rate fpr on the x-axis of a plot. 通过将曲线的y轴上的True正确率tpr放置在x轴上的False Positive率fpr来生成ROC曲线。

sklearn如何计算二进制分类器的roc曲线下的面积？

问题描述

1 个解决方案

解决方案1
0 2017-12-10 20:43:08

sklearn如何计算二进制分类器的roc曲线下的面积？

问题描述

1 个解决方案

解决方案1 0 2017-12-10 20:43:08

解决方案1
0 2017-12-10 20:43:08