簡體   English   中英

在scikit-learn中使用SVC分類器進行錯誤的預測?

[英]Wrong prediction with SVC classifier in scikit-learn?

我生成了自己的語料庫,因此我將其拆分成一個訓練文本文件,如下所示:

POS|This film was awesome, highly recommended
NEG|I did not like this film
NEU|I went to the movies
POS|this film is very interesting, i liked a lot
NEG|the film was very boring i did not like it
NEU|the cinema is big
NEU|the cinema was dark

為了進行測試,我還有另一篇未貼標簽的文字評論:

I did not like this film

然后,我執行以下操作:

import pandas as pd
from sklearn.feature_extraction.text import HashingVectorizer

trainingdata = pd.read_csv('/Users/user/Desktop/training.txt',
                 header=None, sep='|', names=['labels', 'movies_reviews'])


vect = HashingVectorizer(analyzer='word', ngram_range=(2,2), lowercase=True, n_features=7)
X = vect.fit_transform(trainingdata['movies_reviews'])
y = trainingdata['labels']
TestText= pd.read_csv('/Users/user/Desktop/testing.txt',
                     header=None, names=['test_opinions'])
test = vect.transform(TestText['test_opinions'])
from sklearn.svm import SVC
svm = SVC()
svm.fit(X, y)

prediction = svm.predict(test)
print prediction

預測是:

['NEU']

然后我想到的是為什么這個預測是錯誤的? 這是代碼問題還是功能或分類算法問題?,我試着玩這個,當我從訓練文本文件中刪除最后一個評論時,我意識到總是在預測該文件的最后一個元素。 關於如何解決此問題的任何想法嗎?

SVM對參數設置非常敏感。 您將需要進行網格搜索以找到正確的值。 我嘗試在您的數據集上訓練兩種朴素貝葉斯,並且在訓練集上獲得了完美的准確性:

from sklearn.naive_bayes import *
from sklearn.feature_extraction.text import *

# first option- Gaussian NB
vect = HashingVectorizer(analyzer='word', ngram_range=(2,2), lowercase=True)
X = vect.fit_transform(trainingdata['movies_reviews'])
y = trainingdata['labels']
nb = GaussianNB().fit(X.A,y) # input needs to be dense
nb.predict(X.A) == y

# second option- MultinomialNB (input needs to be positive, use CountingVect instead)
vect = CountVectorizer(analyzer='word', ngram_range=(2,2), lowercase=True)
X = vect.fit_transform(trainingdata['movies_reviews'])
y = trainingdata['labels']
nb = MultinomialNB().fit(X,y)
nb.predict(X.A) == y

在這兩種情況下,輸出均為

Out[33]: 
0    True
1    True
2    True
3    True
4    True
5    True
6    True
Name: labels, dtype: bool

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM