简体   繁体   English

sklearn如何计算二进制分类器的roc曲线下的面积?

[英]How does sklearn calculates the area under the roc curve for a binary classifier?

This might appear as a duplication of another question which has been asked here . 这可能是这里另一个问题的重复。 However, I've looked at the answer there and still cannot understand how Scikit-learn calculates the area under the roc curve by testing only one threshold, which is the one provided in the: 但是,我看过那里的答案,但仍然无法理解Scikit-learn如何通过仅测试一个阈值来计算roc曲线下的面积,该阈值是以下内容中提供的:

y_pred = clf.predict(X_test) roc_auc_score(y_test, y_pred)

why it doesn't take multiple values (multiple y_test, y_pred that results from multiple thresholds)?any simplified explanation would be really appreciated. 为什么不采用多个值(多个y_test和y_pred由多个阈值产生)?任何简化的解释将不胜感激。

The second argument for roc_auc_score() in this case should be the prediction probability obtained by clf.predict_proba(X_test) . 在这种情况下, roc_auc_score()的第二个参数应该是clf.predict_proba(X_test)获得的预测概率。 The different thresholds are calculated inside this function on the basis of this prediction probabilities. 根据此预测概率在此函数内部计算不同的阈值。 There is an example for this in the documentation : 文档中有一个示例:

import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
roc_auc_score(y_true, y_scores)
0.75

To understand how the roc_auc_score is caluclated it might be helpful to look at the roc_curve it self. 要了解roc_auc_score方式,查看roc_curve本身可能会有所帮助。 This can be done with the function sklearn.metrics.roc_curve() . 这可以通过功能sklearn.metrics.roc_curve() Example taken from the documentation : 来自文档的示例:

import numpy as np
from sklearn import metrics
y = np.array([1, 1, 2, 2])
scores = np.array([0.1, 0.4, 0.35, 0.8])
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2)
fpr
array([ 0. ,  0.5,  0.5,  1. ])
tpr
array([ 0.5,  0.5,  1. ,  1. ])
thresholds
array([ 0.8 ,  0.4 ,  0.35,  0.1 ])

(Eventhough the y is different in the latter example it is still a binary classification with 2 being the positive class.) (尽管y在后面的示例中是不同的,但它仍然是二进制分类,其中2为正类。)

As can be seen in the latter example the different thresholds are taken from the supplied scores . 从后面的示例中可以看出,从提供的scores中提取了不同的阈值。

The ROC Curve would be generated by putting the True Positive Rate tpr on the y-axis and the False Positive Rate fpr on the x-axis of a plot. 通过将曲线的y轴上的True正确率tpr放置在x轴上的False Positive率fpr来生成ROC曲线

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM