简体   繁体   English

使用 SVM 预测概率

[英]Predict probabilities using SVM

I wrote this code and wanted to obtain probabilities of classification.我写了这段代码,想获得分类的概率。

from sklearn import svm
X = [[0, 0], [10, 10],[20,30],[30,30],[40, 30], [80,60], [80,50]]
y = [0, 1, 2, 3, 4, 5, 6]
clf = svm.SVC() 
clf.probability=True
clf.fit(X, y)
prob = clf.predict_proba([[10, 10]])
print prob

I obtained this output:我得到了这个输出:

[[0.15376986 0.07691205 0.15388546 0.15389275 0.15386348 0.15383004 0.15384636]]

which is very weird because the probability should have been这很奇怪,因为概率应该是

[0 1 0 0 0 0 0 0]

(Observe that the sample for which class has to be predicted is same as 2nd sample) also, probability obtained for that class is the lowest. (注意必须预测类别的样本与第二个样本相同)同样,该类别获得的概率最低。

You should disable probability and use decision_function instead, because there is no guarantee that predict_proba and predict return the same result.您应该禁用probability并使用decision_function代替,因为不能保证predict_probapredict返回相同的结果。 You can read more about it, here in the documentation .您可以在文档中阅读有关它的更多信息。

clf.predict([[10, 10]]) // returns 1 as expected 

prop = clf.decision_function([[10, 10]]) // returns [[ 4.91666667  6.5         3.91666667  2.91666667  1.91666667  0.91666667
      -0.08333333]]
prediction = np.argmax(prop) // returns 1 

EDIT : As pointed out by @TimH, the probablities can be given by clf.decision_function(X) .编辑:正如@TimH 所指出的,概率可以由clf.decision_function(X) The below code is fixed.下面的代码是固定的。 Noting the appointed issue with low probabilities using predict_proba(X) , I think the answer is that according to official doc here , .... Also, it will produce meaningless results on very small datasets.注意到使用predict_proba(X)指定的低概率问题,我认为答案是根据官方文档here...。此外,它会在非常小的数据集上产生毫无意义的结果。

The answer residue in understanding what the resulting probablities of SVMs are.答案是理解 SVM 的结果概率是多少。 In short, you have 7 classes and 7 points in the 2D plane.简而言之,您在 2D 平面中有 7 个类和 7 个点。 What SVMs are trying to do, is to find a linear separator, between each class and each one the others (one-vs-one approach). SVM 试图做的是在每个类之间找到一个线性分隔符(一对一方法)。 Every time only 2 classes are chosen.每次只选择 2 个班级。 What you get is the votes of the classifiers, after normalization .你得到的是归一化后分类器的投票 See more detailed explanation on multi-class SVMs of libsvm in this post or here (scikit-learn uses libsvm).这篇文章或这里(scikit-learn 使用 libsvm)查看更多关于libsvm 的多类 SVM 的详细解释。

By slightly modifying your code, we see that indeed the right class is chosen:通过稍微修改您的代码,我们看到确实选择了正确的类:

from sklearn import svm
import matplotlib.pyplot as plt
import numpy as np


X = [[0, 0], [10, 10],[20,30],[30,30],[40, 30], [80,60], [80,50]]
y = [0, 1, 2, 3, 3, 4, 4]
clf = svm.SVC() 
clf.fit(X, y)

x_pred = [[10,10]]
p = np.array(clf.decision_function(x_pred)) # decision is a voting function
prob = np.exp(p)/np.sum(np.exp(p),axis=1, keepdims=True) # softmax after the voting
classes = clf.predict(x_pred)

_ = [print('Sample={}, Prediction={},\n Votes={} \nP={}, '.format(idx,c,v, s)) for idx, (v,s,c) in enumerate(zip(p,prob,classes))]

The corresponding output is对应的输出是

Sample=0, Prediction=0,
Votes=[ 6.5         4.91666667  3.91666667  2.91666667  1.91666667  0.91666667 -0.08333333] 
P=[ 0.75531071  0.15505748  0.05704246  0.02098475  0.00771986  0.00283998  0.00104477], 
Sample=1, Prediction=1,
Votes=[ 4.91666667  6.5         3.91666667  2.91666667  1.91666667  0.91666667 -0.08333333] 
P=[ 0.15505748  0.75531071  0.05704246  0.02098475  0.00771986  0.00283998  0.00104477], 
Sample=2, Prediction=2,
Votes=[ 1.91666667  2.91666667  6.5         4.91666667  3.91666667  0.91666667 -0.08333333] 
P=[ 0.00771986  0.02098475  0.75531071  0.15505748  0.05704246  0.00283998  0.00104477], 
Sample=3, Prediction=3,
Votes=[ 1.91666667  2.91666667  4.91666667  6.5         3.91666667  0.91666667 -0.08333333] 
P=[ 0.00771986  0.02098475  0.15505748  0.75531071  0.05704246  0.00283998  0.00104477], 
Sample=4, Prediction=4,
Votes=[ 1.91666667  2.91666667  3.91666667  4.91666667  6.5         0.91666667 -0.08333333] 
P=[ 0.00771986  0.02098475  0.05704246  0.15505748  0.75531071  0.00283998  0.00104477], 
Sample=5, Prediction=5,
Votes=[ 3.91666667  2.91666667  1.91666667  0.91666667 -0.08333333  6.5  4.91666667] 
P=[ 0.05704246  0.02098475  0.00771986  0.00283998  0.00104477  0.75531071  0.15505748], 
Sample=6, Prediction=6,
Votes=[ 3.91666667  2.91666667  1.91666667  0.91666667 -0.08333333  4.91666667  6.5       ] 
P=[ 0.05704246  0.02098475  0.00771986  0.00283998  0.00104477  0.15505748  0.75531071], 

And you can also see decision zones:您还可以看到决策区:

X = np.array(X)
y = np.array(y)
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)

XX, YY = np.mgrid[0:100:200j, 0:100:200j]
Z = clf.predict(np.c_[XX.ravel(), YY.ravel()])

Z = Z.reshape(XX.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(XX, YY, Z, cmap=plt.cm.Paired)

for idx in range(7):
    ax.scatter(X[idx,0],X[idx,1], color='k')

在此处输入图片说明

You can read in the docs that...您可以在文档阅读...

The SVC method decision_function gives per-class scores for each sample (or a single score per sample in the binary case). SVC 方法 decision_function 为每个样本提供每个类别的分数(或在二元情况下每个样本的单个分数)。 When the constructor option probability is set to True, class membership probability estimates (from the methods predict_proba and predict_log_proba) are enabled.当构造函数选项概率设置为 True 时,启用类成员概率估计(来自方法 predict_proba 和 predict_log_proba)。 In the binary case, the probabilities are calibrated using Platt scaling : logistic regression on the SVM's scores, fit by an additional cross-validation on the training data.在二元情况下,概率使用 Platt scaling 进行校准:SVM 分数的逻辑回归,通过对训练数据进行额外的交叉验证来拟合。 In the multiclass case, this is extended as per Wu et al.在多类情况下,这是根据 Wu 等人的扩展。 (2004). (2004)。

Needless to say, the cross-validation involved in Platt scaling is an expensive operation for large datasets .毋庸置疑,Platt 缩放中涉及的交叉验证对于大型数据集来说是一项昂贵的操作 In addition, the probability estimates may be inconsistent with the scores , in the sense that the “argmax” of the scores may not be the argmax of the probabilities.此外,概率估计可能与分数不一致,因为分数的“argmax”可能不是概率的 argmax。 (Eg, in binary classification, a sample may be labeled by predict as belonging to a class that has probability <½ according to predict_proba .) Platt's method is also known to have theoretical issues. (例如,在二元分类中,根据 predict_proba样本可能被 predict 标记为属于概率 < 1/2 的类别。)众所周知,Platt 的方法也存在理论问题。 If confidence scores are required, but these do not have to be probabilities, then it is advisable to set probability=False and use decision_function instead of predict_proba.如果需要置信度分数,但这些分数不一定是概率,那么建议设置概率=False 并使用decision_function 代替predict_proba。

There are also lots of confusion about this function amongst Stack Overflow users, as you can see in this thread , or this one . Stack Overflow 用户中也有很多关于此功能的混淆,正如您在此线程线程中所见。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM