Scikit-learn predict_proba给出了错误的答案

Question

This is a follow-up question from How to know what classes are represented in return array from predict_proba in Scikit-learn 这是一个后续问题，如何知道Scikit-learn中的predict_proba在返回数组中表示了哪些类

In that question, I quoted the following code: 在那个问题中，我引用了以下代码：

>>> import sklearn
>>> sklearn.__version__
'0.13.1'
>>> from sklearn import svm
>>> model = svm.SVC(probability=True)
>>> X = [[1,2,3], [2,3,4]] # feature vectors
>>> Y = ['apple', 'orange'] # classes
>>> model.fit(X, Y)
>>> model.predict_proba([1,2,3])
array([[ 0.39097541,  0.60902459]])

I discovered in that question this result represents the probability of the point belonging to each class, in the order given by model.classes_ 我在那个问题中发现这个结果表示属于每个类的点的概率，按照model.classes_给出的顺序

>>> zip(model.classes_, model.predict_proba([1,2,3])[0])
[('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]

So... this answer, if interpreted correctly, says that the point is probably an 'orange' (with a fairly low confidence, due to the tiny amount of data). 所以...这个答案，如果正确解释，说这个点可能是一个“橙色”（由于数据量很小，信心相当低）。 But intuitively, this result is obviously incorrect, since the point given was identical to the training data for 'apple'. 但直觉上，这个结果显然是不正确的，因为给出的点与'apple'的训练数据相同。 Just to be sure, I tested the reverse as well: 只是为了确定，我也测试了反向：

>>> zip(model.classes_, model.predict_proba([2,3,4])[0])
[('apple', 0.60705475211840931), ('orange', 0.39294524788159074)]

Again, obviously incorrect, but in the other direction. 再次，显然不正确，但在另一个方向。

Finally, I tried it with points that were much further away. 最后，我尝试了更远的点。

>>> X = [[1,1,1], [20,20,20]] # feature vectors
>>> model.fit(X, Y)
>>> zip(model.classes_, model.predict_proba([1,1,1])[0])
[('apple', 0.33333332048410247), ('orange', 0.66666667951589786)]

Again, the model predicts the wrong probabilities. 同样，该模型预测错误的概率。 BUT, the model.predict function gets it right! 但是，model.predict功能正确！

>>> model.predict([1,1,1])[0]
'apple'

Now, I remember reading something in the docs about predict_proba being inaccurate for small datasets, though I can't seem to find it again. 现在，我记得在docs中读到一些关于predict_proba对于小数据集不准确的东西，尽管我似乎无法再找到它。 Is this the expected behaviour, or am I doing something wrong? 这是预期的行为，还是我做错了什么？ If this IS the expected behaviour, then why does the predict and predict_proba function disagree one the output? 如果这是预期的行为，那么为什么predict和predict_proba函数不同意输出？ And importantly, how big does the dataset need to be before I can trust the results from predict_proba? 更重要的是，在我可以信任predict_proba的结果之前，数据集需要有多大？

-------- UPDATE -------- --------更新--------

Ok, so I did some more 'experiments' into this: the behaviour of predict_proba is heavily dependent on 'n', but not in any predictable way! 好的，所以我做了一些更多的“实验”：predict_proba的行为严重依赖于'n'，但不是以任何可预测的方式！

>>> def train_test(n):
...     X = [[1,2,3], [2,3,4]] * n
...     Y = ['apple', 'orange'] * n
...     model.fit(X, Y)
...     print "n =", n, zip(model.classes_, model.predict_proba([1,2,3])[0])
... 
>>> train_test(1)
n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]
>>> for n in range(1,10):
...     train_test(n)
... 
n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]
n = 2 [('apple', 0.98437355278112448), ('orange', 0.015626447218875527)]
n = 3 [('apple', 0.90235408180319321), ('orange', 0.097645918196806694)]
n = 4 [('apple', 0.83333299908143665), ('orange', 0.16666700091856332)]
n = 5 [('apple', 0.85714254878984497), ('orange', 0.14285745121015511)]
n = 6 [('apple', 0.87499969631893626), ('orange', 0.1250003036810636)]
n = 7 [('apple', 0.88888844127886335), ('orange', 0.11111155872113669)]
n = 8 [('apple', 0.89999988018127364), ('orange', 0.10000011981872642)]
n = 9 [('apple', 0.90909082368682159), ('orange', 0.090909176313178491)]

How should I use this function safely in my code? 我应该如何在我的代码中安全地使用此功能？ At the very least, is there any value of n for which it will be guaranteed to agree with the result of model.predict? 至少，是否有任何n的值可以保证与model.predict的结果一致？

Answer 1

predict_probas is using the Platt scaling feature of libsvm to callibrate probabilities, see: predict_probas正在使用libsvm的Platt缩放功能来调用概率，请参阅：

How does sklearn.svm.svc's function predict_proba() work internally? sklearn.svm.svc的函数predict_proba（）如何在内部工作？

So indeed the hyperplane predictions and the proba calibration can disagree, especially if you only have 2 samples in your dataset. 事实上，超平面预测和问题校准可能不一致，特别是如果您的数据集中只有2个样本。 It's weird that the internal cross validation done by libsvm for scaling the probabilities does not fail (explicitly) in this case. 奇怪的是，在这种情况下，libsvm用于扩展概率的内部交叉验证不会（明确地）失败。 Maybe this is a bug. 也许这是一个错误。 One would have to dive into the Platt scaling code of libsvm to understand what's happening. 人们不得不深入研究libsvm的Platt缩放代码以了解正在发生的事情。

Answer 2

if you use svm.LinearSVC() as estimator, and .decision_function() (which is like svm.SVC's .predict_proba()) for sorting the results from most probable class to the least probable one. 如果你使用svm.LinearSVC()作为估计器，而.decision_function() （类似于svm.SVC的.predict_proba（））用于将结果从最可能的类排序到最不可能的类。 this agrees with .predict() function. 这与.predict()函数一致。 Plus, this estimator is faster and gives almost the same results with svm.SVC() 另外，这个估算器更快，并且使用svm.SVC()得到几乎相同的结果

the only drawback for you might be that .decision_function() gives a signed value sth like between -1 and 3 instead of a probability value. 你唯一的缺点可能是.decision_function()给出一个有符号的值，比如介于-1和3而不是概率值。 but it agrees with the prediction. 但它同意预测。

Answer 3

Food for thought here. 这里有思想的食物。 I think i actually got predict_proba to work as is. 我想我确实让predict_proba按原样工作。 Please see code below... 请看下面的代码......

# Test data
TX = [[1,2,3], [4,5,6], [7,8,9], [10,11,12], [13,14,15], [16,17,18], [19,20,21], [22,23,24]]
TY = ['apple', 'orange', 'grape', 'kiwi', 'mango','peach','banana','pear']

VX2 = [[16,17,18], [19,20,21], [22,23,24], [13,14,15], [10,11,12], [7,8,9], [4,5,6], [1,2,3]]
VY2 = ['peach','banana','pear','mango', 'kiwi', 'grape', 'orange','apple']

VX2_df = pd.DataFrame(data=VX2) # convert to dataframe
VX2_df = VX2_df.rename(index=float, columns={0: "N0", 1: "N1", 2: "N2"})
VY2_df = pd.DataFrame(data=VY2) # convert to dataframe
VY2_df = VY2_df.rename(index=float, columns={0: "label"})

# NEW - in testing
def train_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):

    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the top n labels on validation dataset
    n = 5
    #classifier.probability = True
    probas = classifier.predict_proba(feature_vector_valid)
    predictions = classifier.predict(feature_vector_valid)

    #Identify the indexes of the top predictions
    #top_n_predictions = np.argsort(probas)[:,:-n-1:-1]
    top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]

    #then find the associated SOC code for each prediction
    top_socs = classifier.classes_[top_n_predictions]

    #cast to a new dataframe
    top_n_df = pd.DataFrame(data=top_socs)

    #merge it up with the validation labels and descriptions
    results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
    results = pd.merge(results, top_n_df, left_index=True, right_index=True)

    conditions = [
        (results['label'] == results[0]),
        (results['label'] == results[1]),
        (results['label'] == results[2]),
        (results['label'] == results[3]),
        (results['label'] == results[4])]
    choices = [1, 1, 1, 1, 1]
    results['Successes'] = np.select(conditions, choices, default=0)

    print("Top 5 Accuracy Rate = ", sum(results['Successes'])/results.shape[0])
    print("Top 1 Accuracy Rate = ", metrics.accuracy_score(predictions, valid_y))

train_model(naive_bayes.MultinomialNB(), TX, TY, VX2, VY2_df, VX2_df)

Output: Top 5 Accuracy Rate = 1.0 Top 1 Accuracy Rate = 1.0 输出：前5个准确率= 1.0前1个准确率= 1.0

Couldn't get it to work for my own data though :( 无法让它为我自己的数据工作:(

Answer 4

There is some confusion as to what predict_proba actually does. 对于predict_proba实际上做了什么，有一些混乱。 It does not predict probabilities as the title suggests, but outputs distances. 它不像标题所暗示的那样预测概率，而是输出距离。 In the apple vs orange example 0.39097541, 0.60902459 the shortest distance 0.39097541 is the apple class. 在苹果vs橙色示例0.39097541,0.60902459中，最短距离0.39097541是苹果类。 which is counter intuitive. 这是反直觉的。 you are looking at the highest probability, but its not the case. 你看的概率最高，但并非如此。

Another source of confusion stems from that predict_proba does match hard labels, just not in the order of classes, from 0..n sequentially . 混淆的另一个原因源于predict_proba确实匹配硬标签，而不是类的顺序，从0..n顺序。 Scikit seems to shuffle the classes, but it is possible to map them. Scikit似乎改变了类，但可以映射它们。

here is how it works. 下面是它的工作原理。

   say we have 5 classes with labels:
   classifier.classes_ = [0 1 2 3 4]
   target names = ['1', '2', '3', '6', '8']

predicted labels [2 0 1 0 4] 预测标签[2 0 1 0 4]

    classifier.predict_proba
    [[ 0.20734121  0.20451986  0.17262553  0.20768649  0.20782692]
     [ 0.19099348  0.2018391   0.20222314  0.20136784  0.20357644]
     [ 0.19982284  0.19497121  0.20399376  0.19824784  0.20296435]
     [ 0.19884577  0.1999416   0.19998889  0.20092702  0.20029672]
     [ 0.20328893  0.2025956   0.20500402  0.20383255  0.1852789 ]]

    Confusion matrix:
    [[1 0 0 0 0]
     [0 1 0 0 0]
     [0 0 1 0 0]
     [1 0 0 0 0]
     [0 0 0 0 1]]

    y_test [2 0 1 3 4]
    pred [2 0 1 0 4]
    classifier.classes_ = [0 1 2 3 4]

anything but the third class is a match. 除了第三节课以外什么都不配。 according to predicted labels in cm, class 0 is predicted and actual class is 0 argmax(pred_prob). 根据以cm为单位的预测标签，预测出0级，实际等级为0 argmax（pred_prob）。 But, its mapped to 但是，它映射到

     y_test [2 0 1 3 4]

so find the second class 所以找到第二堂课

    0              1             2          3          4
    [ 0.20734121  0.20451986  0.17262553  0.20768649  0.20782692]
    and the winner is **0.17262553**

let's do it again. 让我们再来一次。 look at the misclassification result numero 4 where actual lebel 4, predicted 1 according to cm. 看看错误分类结果，其中实际的lebel 4，根据cm预测为1。

    BUT y_test [2 0 1 3 4] pred [2 0 1 0 4]
    which translates to actual label 3 predicted label 0
    0             1             2            3        4
    ]0.19884577  0.1999416   0.19998889  0.20092702  0.20029672]
    look at label number 0, and the winner is **0.19884577**

These are my 0.02. 这些是我的0.02。

Scikit-learn predict_proba给出了错误的答案

问题描述

4 个解决方案

解决方案1
21 2013-06-10 08:34:23

解决方案2
19 已采纳 2013-06-17 07:28:55

解决方案3
0 2019-02-28 02:33:59

解决方案4
-1 2016-10-29 00:13:41

Scikit-learn predict_proba给出了错误的答案

问题描述

4 个解决方案

解决方案1 21 2013-06-10 08:34:23

解决方案2 19 已采纳 2013-06-17 07:28:55

解决方案3 0 2019-02-28 02:33:59

解决方案4 -1 2016-10-29 00:13:41

解决方案1
21 2013-06-10 08:34:23

解决方案2
19 已采纳 2013-06-17 07:28:55

解决方案3
0 2019-02-28 02:33:59

解决方案4
-1 2016-10-29 00:13:41