[英]Accuracy for each probability cutoff in a binary classification problem (python sklearn accuracy)
Imagine a binary classification problem.想象一个二元分类问题。 Let's say I have 800,000 predicted probabilities stored in
pred_test
.假设我在
pred_test
存储了 800,000 个预测概率。 I define a cutoff
as any value in pred_test
such that the values that are greater than or equal to cutoff
are assigned the value 1 and the values that are smaller than cutoff
are assigned the value 0.我将
cutoff
值定义为pred_test
任何值,这样大于或等于cutoff
值的值被分配值 1,小于cutoff
值的值被分配值 0。
Is there a function in sklearn
that returns the accuracy of the model for each cutoff
in pred_train
?是否有一个函数
sklearn
返回该模型的准确性每个cutoff
于pred_train
? I would like to see the accuracy of the model as a function of each cutoff to systematically pick a cutoff.我想看到模型的准确性作为每个截止点的函数,以系统地选择一个截止点。
I tried the following:我尝试了以下方法:
_list = []
for cutoff in np.unique(np.sort(pred_test)):
binary_prediction = np.where(pred_test >= cutoff, 1, 0)
_list.append( (cutoff, binary_prediction == y_test).sum() / len(pred_test) )
Here, y_test
is the ground truth (an array with the observed outcomes for each of the 800,000 rows).在这里,
y_test
是基本事实(一个数组,其中包含 800,000 行中每一行的观察结果)。 This code returns a list where each value contains the cutoff and its corresponding accuracy score.此代码返回一个列表,其中每个值都包含截止值及其相应的准确度分数。
The object pred_test
has around 600,000 different values, so I am iterating 600,000 or so times.对象
pred_test
有大约 600,000 个不同的值,所以我迭代了 600,000 次左右。 The above code is working, but it's taking a very long time to finish.上面的代码正在运行,但需要很长时间才能完成。 Is there a more efficient way to do this?
有没有更有效的方法来做到这一点? My bet is that
sklearn
already has a function that does this.我敢打赌
sklearn
已经有一个功能可以做到这一点。
here is some similiar thread to check it: Getting the maximum accuracy for a binary probabilistic classifier in scikit-learn这里有一些类似的线程来检查它: 在 scikit-learn 中获得二元概率分类器的最大精度
There is no built-in function for that in scikit-learn. scikit-learn 中没有内置函数。 I think the reason why this is not implemented is that you will have the chance to overfit, you basically will tune your train set to a baseline that is risky for the test set.
我认为未实施的原因是您将有机会过度拟合,您基本上会将您的训练集调整为对测试集有风险的基线。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.