在scikit-learn中使用SVC分类器进行错误的预测？

Question

I generated my own corpus, so I split into a training text file like this: 我生成了自己的语料库，因此我将其拆分成一个训练文本文件，如下所示：

POS|This film was awesome, highly recommended
NEG|I did not like this film
NEU|I went to the movies
POS|this film is very interesting, i liked a lot
NEG|the film was very boring i did not like it
NEU|the cinema is big
NEU|the cinema was dark

And for testing I have another text review, which is unlabeled: 为了进行测试，我还有另一篇未贴标签的文字评论：

I did not like this film

Then I do the following: 然后，我执行以下操作：

import pandas as pd
from sklearn.feature_extraction.text import HashingVectorizer

trainingdata = pd.read_csv('/Users/user/Desktop/training.txt',
                 header=None, sep='|', names=['labels', 'movies_reviews'])


vect = HashingVectorizer(analyzer='word', ngram_range=(2,2), lowercase=True, n_features=7)
X = vect.fit_transform(trainingdata['movies_reviews'])
y = trainingdata['labels']
TestText= pd.read_csv('/Users/user/Desktop/testing.txt',
                     header=None, names=['test_opinions'])
test = vect.transform(TestText['test_opinions'])
from sklearn.svm import SVC
svm = SVC()
svm.fit(X, y)

prediction = svm.predict(test)
print prediction

And the prediction is: 预测是：

['NEU']

Then something that comes to my mind is why this prediction is wrong?. 然后我想到的是为什么这个预测是错误的？ Does this is a code problem or a feature or a classification algorithm problem?, I tried to play with this and when I remove the last review from the training text file I realize that always is predicting the last element of that file. 这是代码问题还是功能或分类算法问题？，我试着玩这个，当我从训练文本文件中删除最后一个评论时，我意识到总是在预测该文件的最后一个元素。 Any idea of how to fix this problem?. 关于如何解决此问题的任何想法吗？

Answer 1

SVMs are notoriously sensitive to parameter settings. SVM对参数设置非常敏感。 You will need to do a grid search to find the right values. 您将需要进行网格搜索以找到正确的值。 I tried training two kinds of Naive Bayes on your dataset and I got perfect accuracy on the training set: 我尝试在您的数据集上训练两种朴素贝叶斯，并且在训练集上获得了完美的准确性：

from sklearn.naive_bayes import *
from sklearn.feature_extraction.text import *

# first option- Gaussian NB
vect = HashingVectorizer(analyzer='word', ngram_range=(2,2), lowercase=True)
X = vect.fit_transform(trainingdata['movies_reviews'])
y = trainingdata['labels']
nb = GaussianNB().fit(X.A,y) # input needs to be dense
nb.predict(X.A) == y

# second option- MultinomialNB (input needs to be positive, use CountingVect instead)
vect = CountVectorizer(analyzer='word', ngram_range=(2,2), lowercase=True)
X = vect.fit_transform(trainingdata['movies_reviews'])
y = trainingdata['labels']
nb = MultinomialNB().fit(X,y)
nb.predict(X.A) == y

In both cases the output is 在这两种情况下，输出均为

Out[33]: 
0    True
1    True
2    True
3    True
4    True
5    True
6    True
Name: labels, dtype: bool

在scikit-learn中使用SVC分类器进行错误的预测？

问题描述

1 个解决方案

解决方案1
1 2015-01-03 10:01:19

在scikit-learn中使用SVC分类器进行错误的预测？

问题描述

1 个解决方案

解决方案1 1 2015-01-03 10:01:19

解决方案1
1 2015-01-03 10:01:19