二元分类问题中每个概率临界值的准确度（python sklearn 准确度）

Question

Imagine a binary classification problem.想象一个二元分类问题。 Let's say I have 800,000 predicted probabilities stored in pred_test .假设我在pred_test存储了 800,000 个预测概率。 I define a cutoff as any value in pred_test such that the values that are greater than or equal to cutoff are assigned the value 1 and the values that are smaller than cutoff are assigned the value 0.我将cutoff值定义为pred_test任何值，这样大于或等于cutoff值的值被分配值 1，小于cutoff值的值被分配值 0。

Is there a function in sklearn that returns the accuracy of the model for each cutoff in pred_train ?是否有一个函数sklearn返回该模型的准确性每个cutoff于pred_train ？ I would like to see the accuracy of the model as a function of each cutoff to systematically pick a cutoff.我想看到模型的准确性作为每个截止点的函数，以系统地选择一个截止点。

I tried the following:我尝试了以下方法：

_list = []
for cutoff in np.unique(np.sort(pred_test)):
    binary_prediction = np.where(pred_test >= cutoff, 1, 0)
    _list.append( (cutoff, binary_prediction == y_test).sum() / len(pred_test) )

Here, y_test is the ground truth (an array with the observed outcomes for each of the 800,000 rows).在这里， y_test是基本事实（一个数组，其中包含 800,000 行中每一行的观察结果）。 This code returns a list where each value contains the cutoff and its corresponding accuracy score.此代码返回一个列表，其中每个值都包含截止值及其相应的准确度分数。

The object pred_test has around 600,000 different values, so I am iterating 600,000 or so times.对象pred_test有大约 600,000 个不同的值，所以我迭代了 600,000 次左右。 The above code is working, but it's taking a very long time to finish.上面的代码正在运行，但需要很长时间才能完成。 Is there a more efficient way to do this?有没有更有效的方法来做到这一点？ My bet is that sklearn already has a function that does this.我敢打赌sklearn已经有一个功能可以做到这一点。

Answer 1

here is some similiar thread to check it: Getting the maximum accuracy for a binary probabilistic classifier in scikit-learn这里有一些类似的线程来检查它：在 scikit-learn 中获得二元概率分类器的最大精度

There is no built-in function for that in scikit-learn. scikit-learn 中没有内置函数。 I think the reason why this is not implemented is that you will have the chance to overfit, you basically will tune your train set to a baseline that is risky for the test set.我认为未实施的原因是您将有机会过度拟合，您基本上会将您的训练集调整为对测试集有风险的基线。

二元分类问题中每个概率临界值的准确度（python sklearn 准确度）

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-01-28 06:54:53

二元分类问题中每个概率临界值的准确度（python sklearn 准确度）

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-01-28 06:54:53

解决方案1
3 已采纳 2020-01-28 06:54:53