简体繁体 English

高斯贝叶斯分类器的概率分类与Logistic回归

[英]Probabilistic classification with Gaussian Bayes Classifier vs Logistic Regression

原文 2018-11-16 13:55:18 1 1 machine-learning/ classification/ data-science/ logistic-regression/ naivebayes

I have a binary classification problem where I have a few great features that have the power to predict almost 100% of the test data because the problem is relatively simple. 我有一个二进制分类问题，其中有一些很棒的功能可以预测几乎100％的测试数据，因为该问题相对简单。

However, as the nature of the problem requires, I have no luxury to make mistake(let's say) so instead of giving a prediction I am not sure of, I would rather have the output as probability, set a threshold and would be able to say, "if I am less than %95 sure, I will call this "NOT SURE" and act accordingly". 但是，根据问题的性质，我不能奢侈地犯错误（比如说），所以我不想给出我不确定的预测，而是希望将输出作为概率，设定一个阈值，并且能够说，“如果我确定的百分比不到95％，我将其称为“不确定”并采取相应措施”。 Saying "I don't know" rather than making a mistake is better. 说“我不知道”而不是犯错误会更好。

So far so good. 到现在为止还挺好。

For this purpose, I tried Gaussian Bayes Classifier(I have a cont. feature) and Logistic Regression algorithms, which provide me the probability as well as the prediction for the classification. 为此，我尝试了高斯贝叶斯分类器（具有连续性）和Logistic回归算法，这些算法为我提供了分类的概率以及预测。

Coming to my Problem: 来解决我的问题：

GBC has around 99% success rate while Logistic Regression has lower, around 96% success rate. GBC的成功率约为99％，而Logistic回归的成功率较低，约为96％。 So I naturally would prefer to use GBC. 因此，我自然会更喜欢使用GBC。 However, as successful as GBC is, it is also very sure of itself. 但是，就像GBC一样成功，它也很确定自己。 The odds I get are either 1 or very very close to 1, such as 0.9999997, which makes things tough for me, because in practice GBC does not provide me probabilities now. 我得到的赔率是1或非常接近1，例如0.9999997，这使我感到困难，因为实际上GBC现在不提供给我概率。
Logistic Regression works poor, but at least gives better and more 'sensible' odds. Logistic回归的效果很差，但至少可以提供更好和更“合理”的赔率。

As nature of my problem, the cost of misclassifying is by the power of 2 so if I misclassify 4 of the products, I lose 2^4 more (it's unit-less but gives an idea anyway). 作为我的问题的本质，错误分类的成本是2的乘方，因此，如果我对4个产品进行错误分类，则会多损失2 ^ 4（这是无单位的，但无论如何都会给出想法）。

In the end; 到底; I would like to be able to classify with a higher success than Logistic Regression, but also be able to have more probabilities so I can set a threshold and point out the ones I am not sure of. 我希望能够进行比Logistic回归更高的分类，但同时也能够具有更多的概率，因此我可以设置一个阈值并指出不确定的阈值。

Any suggestions? 有什么建议么？

Thanks in advance. 提前致谢。

1 个解决方案

If you have enough data, you can simply retune the probabilities. 如果您有足够的数据，则可以简单地重新调整概率。 For example, given the "predicted probability" output of your gaussian classifier, you can go back through (on a held out dataset) and at different prediction values, estimate the probability of the positive class. 例如，给定高斯分类器的“预测概率”输出，您可以返回（在保留的数据集上），并以不同的预测值估算正分类的概率。

Further, you can simply set up an optimization on your holdout set to determine the best threshold(without actually estimating the probability). 此外，您可以简单地对保持集进行优化以确定最佳阈值（而无需实际估计概率）。 Since it's one dimensional, you shouldn't even need to do anything fancy for optimization-- test like 500 different thresholds and pick the one which minimizes the costs associated with misclassifications. 由于它是一维的，因此您甚至不需要做任何优化工作-可以测试500个不同的阈值，然后选择一个可以最大程度减少与错误分类相关的成本的阈值。