簡體   English   中英

NLTK 分類器在情緒分析中只給出否定答案

[英]NLTK Classifier giving only negative as answer in Sentiment Analysis

我正在使用 NLTK 進行情緒分析,使用內置的語料庫movie_reviews進行訓練,並且每次我都得到neg結果。

我的代碼:

import nltk
import random
import pickle
from nltk.corpus import movie_reviews
from os.path import exists
from nltk.classify import apply_features
from nltk.tokenize import word_tokenize, sent_tokenize

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())
print(word_features)

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)
    return features

featuresets = [(find_features(rev), category) for (rev, category) in documents]
numtrain = int(len(documents) * 90 / 100)
training_set = apply_features(find_features, documents[:numtrain])
testing_set = apply_features(find_features, documents[numtrain:])

classifier = nltk.NaiveBayesClassifier.train(training_set)
classifier.show_most_informative_features(15)

Example_Text = " avoids annual conveys vocal thematic doubts fascination slip avoids outstanding thematic astounding seamless"

doc = word_tokenize(Example_Text.lower())
featurized_doc = {i:(i in doc) for i in word_features} 
tagged_label = classifier.classify(featurized_doc)
print(tagged_label)

在這里,我用NaiveBayes Classifier在那里我與訓練數據movie_reviews語料庫,然后用這個訓練分類來測試我的情緒Example_test

現在你可以看到我的Example_Text ,它有一些隨機的詞。 當我做classifier.show_most_informative_features(15) ,它給了我一個包含 15 個單詞的列表,這些單詞的正負比例最高。 我選擇了此列表中顯示的正面詞。

Most Informative Features
                  avoids = True              pos : neg    =     12.1 : 1.0
               insulting = True              neg : pos    =     10.8 : 1.0
               atrocious = True              neg : pos    =     10.6 : 1.0
             outstanding = True              pos : neg    =     10.2 : 1.0
                seamless = True              pos : neg    =     10.1 : 1.0
                thematic = True              pos : neg    =     10.1 : 1.0
              astounding = True              pos : neg    =     10.1 : 1.0
                    3000 = True              neg : pos    =      9.9 : 1.0
                  hudson = True              neg : pos    =      9.9 : 1.0
               ludicrous = True              neg : pos    =      9.8 : 1.0
                   dread = True              pos : neg    =      9.5 : 1.0
                   vocal = True              pos : neg    =      9.5 : 1.0
                 conveys = True              pos : neg    =      9.5 : 1.0
                  annual = True              pos : neg    =      9.5 : 1.0
                    slip = True              pos : neg    =      9.5 : 1.0

那么為什么我沒有得到pos作為結果,為什么即使分類器經過正確訓練,我總是得到neg

問題在於您將所有單詞都包含為特征,而“word:False”形式的特征會產生大量額外的噪音,從而淹沒了這些積極特征。 我查看了兩個對數概率,它們非常相似:-812 與 -808。 在這類問題中,一般只使用 word:True 風格特征是合適的,因為所有其他的只會增加噪音。

我復制了你的代碼,但修改了最后三行如下:

featurized_doc = {c:True for c in Example_Text.split()}
tagged_label = classifier.classify(featurized_doc)
print(tagged_label)

並得到輸出“pos”

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM