[英]How does sklearn calculates the area under the roc curve for a binary classifier?
This might appear as a duplication of another question which has been asked here . 这可能是这里另一个问题的重复。 However, I've looked at the answer there and still cannot understand how Scikit-learn calculates the area under the roc curve by testing only one threshold, which is the one provided in the: 但是,我看过那里的答案,但仍然无法理解Scikit-learn如何通过仅测试一个阈值来计算roc曲线下的面积,该阈值是以下内容中提供的:
y_pred = clf.predict(X_test) roc_auc_score(y_test, y_pred)
why it doesn't take multiple values (multiple y_test, y_pred that results from multiple thresholds)?any simplified explanation would be really appreciated. 为什么不采用多个值(多个y_test和y_pred由多个阈值产生)?任何简化的解释将不胜感激。
The second argument for roc_auc_score()
in this case should be the prediction probability obtained by clf.predict_proba(X_test)
. 在这种情况下, roc_auc_score()
的第二个参数应该是clf.predict_proba(X_test)
获得的预测概率。 The different thresholds are calculated inside this function on the basis of this prediction probabilities. 根据此预测概率在此函数内部计算不同的阈值。 There is an example for this in the documentation : 文档中有一个示例:
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
roc_auc_score(y_true, y_scores)
0.75
To understand how the roc_auc_score
is caluclated it might be helpful to look at the roc_curve
it self. 要了解roc_auc_score
方式,查看roc_curve
本身可能会有所帮助。 This can be done with the function sklearn.metrics.roc_curve()
. 这可以通过功能sklearn.metrics.roc_curve()
。 Example taken from the documentation : 来自文档的示例:
import numpy as np
from sklearn import metrics
y = np.array([1, 1, 2, 2])
scores = np.array([0.1, 0.4, 0.35, 0.8])
fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2)
fpr
array([ 0. , 0.5, 0.5, 1. ])
tpr
array([ 0.5, 0.5, 1. , 1. ])
thresholds
array([ 0.8 , 0.4 , 0.35, 0.1 ])
(Eventhough the y
is different in the latter example it is still a binary classification with 2 being the positive class.) (尽管y
在后面的示例中是不同的,但它仍然是二进制分类,其中2为正类。)
As can be seen in the latter example the different thresholds are taken from the supplied scores
. 从后面的示例中可以看出,从提供的scores
中提取了不同的阈值。
The ROC Curve would be generated by putting the True Positive Rate tpr
on the y-axis and the False Positive Rate fpr
on the x-axis of a plot. 通过将曲线的y轴上的True正确率tpr
放置在x轴上的False Positive率fpr
来生成ROC曲线 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.