简体   繁体   English

sklearn如何计算两个二进制输入的roc曲线下的面积?

[英]How does sklearn calculate the area under the roc curve for two binary inputs?

I noticed that sklearn has the following function:我注意到 sklearn 具有以下功能:

sklearn.metrics.roc_auc_score()

which takes as input ground_truth and prediction.它将ground_truth和预测作为输入。

For example,例如,

ground_truth = [1,1,0,0,0]
prediction = [1,1,0,0,0]

sklearn.metrics.roc_auc_score(ground_truth, prediction) returns 1 sklearn.metrics.roc_auc_score(ground_truth, prediction)返回1

My problem is that I can't figure out how sklearn calculates the area under the ROC curve with two binary inputs.我的问题是我无法弄清楚 sklearn 如何使用两个二进制输入计算 ROC 曲线下的面积。 Isn't the ROC curve derived by moving the class assignment threshold, and calculating the false alarm and hit rate for each threshold? ROC曲线不是通过移动类分配阈值,并计算每个阈值的误报和命中率得出的吗? With two binary inputs, shouldn't you only have one (false alarm, hit rate) measurement?有两个二进制输入,您不应该只有一个(误报、命中率)测量吗?

Many thanks!非常感谢!

You're correct that with binary predictions you'll only have a single threshold/measurement for the curve.您是正确的,使用二元预测,您将只有一个曲线阈值/测量值。 I didn't understand it myself so I ran the code with a ton of print statements both for the sklearn tutorial and then with a purely binary example.我自己也不明白,所以我在 sklearn 教程和一个纯二进制示例中使用大量打印语句运行代码。 All the magic is happening in sklearn.metrics._binary_clf_curve所有的魔法都发生在sklearn.metrics._binary_clf_curve

The "thresholds" are distinct prediction scores. “阈值”是不同的预测分数。 For any binary classifier that outputs purely ones and zeros you're going to get two thresholds - 1 and 0 (they're sorted internally from highest to lowest).对于任何输出纯 1 和 0 的二进制分类器,您将获得两个阈值 - 1 和 0(它们在内部从最高到最低排序)。 At the 1 threshold, a prediction score of >=1 is true and anything below that (only 0 in this case) is considered false, and the TP and FP rates are calculated from that.在 1 阈值处,>=1 的预测分数为真,低于此值(在这种情况下仅为 0)的任何内容都被认为是错误的,并且由此计算出 TP 和 FP 率。 In all cases, the last threshold categorizes everything as true so the TP and FP rates will both be 1.在所有情况下,最后一个阈值将所有内容归类为真,因此 TP 和 FP 率都将为 1。

It appears then that to generate a correct ROC curve for a sklearn classifier you'd use clf.predict_proba() rather than predict() .看来,要为 sklearn 分类器生成正确的 ROC 曲线,您将使用clf.predict_proba()而不是predict() Or, maybe predict_log_proba() ?或者,也许predict_log_proba() I'm not sure if it would make any difference我不确定它是否会有所作为

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM