简体   繁体   English

Scikit使用SVC学习错误的预测

[英]Scikit learn wrong predictions with SVC

I am trying to predict the MNIST ( http://pjreddie.com/projects/mnist-in-csv/ ) dataset with an SVM using the radial kernel. 我正在尝试使用径向核的SVM预测MNIST( http://pjreddie.com/projects/mnist-in-csv/ )数据集。 I want to train with few examples (eg 1000) and predict many more. 我想训练几个例子(例如1000个)并预测更多例子。 The problem is that whenever I predict, the predictions are constant unless the indices of the test set coincide with those of the training set. 问题在于,只要我进行预测, 除非测试集的索引与训练集的索引一致, 否则预测是恒定的。 That is, suppose I train with examples 1:1000 from my training examples. 也就是说,假设我从训练示例中以示例1:1000进行了训练。 Then, the predictions are correct (ie the SVM does its best) for 1:1000 of my test set, but then I get the same output for the rest. 然后,对于我的测试集的1:1000的预测是正确的(即SVM尽其所能),但是其余部分我得到相同的输出。 If however I train with examples 2001:3000, then only the test examples corresponding to those rows in the test set are labeled correctly (ie not with the same constant). 但是,如果我使用示例2001:3000进行训练,则仅正确地标记了与测试集中的那些行相对应的测试示例(即,不具有相同的常数)。 I am completely at a loss, and I think that there is some sort of bug, because the exact same code works just fine with LinearSVC, although evidently the accuracy of the method is lower. 我完全不知所措,我认为这存在某种错误,因为与完全相同的代码在LinearSVC上可以正常工作,尽管显然该方法的准确性较低。

First, I train with examples 501:1000 of training data: 首先,我以示例501:1000的训练数据进行训练:

# dat_train/test are pandas DFs corresponding to both MNIST datasets
dat_train = pd.read_csv('data/mnist_train.csv', header=None)
dat_test = pd.read_csv('data/mnist_train.csv', header=None)

svm = SVC(C=10.0)
idx = range(1000)
#idx = np.random.choice(range(len(dat_train)), size=1000, replace=False)
X_train = dat_train.iloc[idx,1:].reset_index(drop=True).as_matrix()
y_train = dat_train.iloc[idx,0].reset_index(drop=True).as_matrix()
X_test = dat_test.reset_index(drop=True).as_matrix()[:,1:]
y_test = dat_test.reset_index(drop=True).as_matrix()[:,0]
svm.fit(X=X_train[501:1000,:], y=y_train[501:1000])

Here you can see that about half the predictions are wrong 在这里您可以看到大约一半的预测是错误的

y_pred = svm.predict(X_test[:1000,:])
confusion_matrix(y_test[:1000], y_pred)

All wrong (ie constant) 都是错误的(即常量)

y_pred = svm.predict(X_test[:500,:])
confusion_matrix(y_test[:500], y_pred)

This is what I would expect to see for all test data 这是我希望所有测试数据都能看到的

y_pred = svm.predict(X_test[501:1000,:])
confusion_matrix(y_test[501:1000], y_pred)

You can check that all of the above are correct using LinearSVC! 您可以使用LinearSVC检查以上所有内容是否正确!

The default kernel is RBF, in which case gamma matters. 默认内核是RBF,在这种情况下, gamma重要。 If gamma is not provided, it is auto by default, which is 1/n_features. 如果未提供gamma ,则默认情况下它是auto的,即1 / n_features。 You'd better run grid search to find the optimal parameters. 您最好运行网格搜索以找到最佳参数。 Here I just illustrate the result is normal given suitable parameters. 在这里,我仅说明在给定适当参数的情况下结果是正常的。

In [120]: svm = SVC(C=1, gamma=0.0000001)

In [121]: svm.fit(X=X_train[501:1000,:], y=y_train[501:1000])
Out[121]:
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1e-07, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [122]: y_pred = svm.predict(X_test[:1000,:])

In [123]: confusion_matrix(y_test[:1000], y_pred)
Out[123]:
array([[ 71,   0,   2,   0,   2,   9,   1,   0,   0,   0],
       [  0, 123,   0,   0,   0,   1,   1,   0,   1,   0],
       [  2,   5,  91,   1,   1,   1,   3,   7,   5,   0],
       [  0,   1,   4,  48,   0,  40,   1,   5,   7,   1],
       [  0,   0,   0,   0,  88,   2,   3,   2,   0,  15],
       [  1,   1,   1,   0,   2,  77,   0,   3,   1,   1],
       [  3,   0,   3,   0,   5,   4,  72,   0,   0,   0],
       [  0,   2,   3,   0,   3,   0,   1,  88,   1,   1],
       [  2,   0,   1,   2,   3,   9,   1,   4,  63,   4],
       [  0,   1,   0,   0,  16,   3,   0,  11,   1,  62]])

Finding good parameters for an SVC is an art in itself. 为SVC寻找好的参数本身就是一门艺术。 Grid Search might help, better works some population based training like in this article - i recently tried it. 网格搜索可能会有所帮助,更好地像本文中所述的那样进行一些基于人口的培训 -我最近尝试过。 If you let it run the same time, it has better results than GridSearch. 如果让它同时运行,它将比GridSearch具有更好的结果。 If you let it run until the accuracy is the same, it is faster. 如果让它运行直到精度相同,它就会更快。

It also helps to make a graphic: let the x and y axis be C and gamma, and plot the prediction scores as color. 它还有助于制作图形:让x和y轴为C和gamma,并将预测分数绘制为颜色。 Usually you will find kind of a V-Shape with the best training results at the point where the two lines meet. 通常,您会在两条直线相交的点找到训练效果最好的V形。 At the same time this point has low C-Values, too, which is desirable because C determines the runtime of the SVC: High C makes a long runtime. 同时,这一点也具有较低的C值,这是可取的,因为C决定了SVC的运行时间:高C使运行时间长。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM