简体   繁体   English

情感分析,朴素贝叶斯准确性

[英]Sentiment Analysis, Naive Bayes Accuracy

I'm trying to form a Naive Bayes Classifier script for sentiment classification of tweets. 我正在尝试形成Naive Bayes分类器脚本以对推文进行情感分类。 I'm pasting my whole code here, because I know I will get hell if I don't. 我将整个代码粘贴到这里,因为我知道如果不这样做,我会陷入困境。 So I basically I use NLTK's corpuses as training data, and then some tweets I scraped as test data. 因此,基本上,我将NLTK的语料库用作训练数据,然后将一些我收集的推文用作测试数据。 I pre-process them and do a bag of words extraction. 我对其进行预处理,然后提取一袋单词。 The classifier is trained with no problem and when I do the following 分类器训练没有问题,当我执行以下操作时

print(classifier.classify(bag_of_words('This is magnificent')))  

it correctly outputs 'pos'. 它正确输出“ pos”。

Now my problem is how to calculate accuracy using ntlk.util accuracy. 现在我的问题是如何使用ntlk.util精度来计算精度。 I do 我做

print(nltk.classify.accuracy(classifier, proc_set))

and I get the following error: 我收到以下错误:

  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-   packages/nltk/classify/util.py", line 87, in accuracy
  results = classifier.classify_many([fs for (fs, l) in gold])
  AttributeError: 'NaiveBayesClassifier' object has no attribute 'classify_many'

I also tried this 我也尝试过

test_set_final=[]
for tweet in proc_test:
test_set_final.append((bag_of_words(tweet),   classifier.classify(bag_of_words(tweet))))

print(nltk.classify.accuracy(classifier, test_set_final))

and I get the same kind of error 我得到同样的错误

print(nltk.classify.accuracy(classifier, test_set_final))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/nltk/classify/util.py", line 87, in accuracy
results = classifier.classify_many([fs for (fs, l) in gold])
AttributeError: 'NaiveBayesClassifier' object has no attribute 'classify_many'

I am 100% I am missing something extremely obvious for Machine Learners.I think t But it's been 3 days and I'm slowly losing my mind, so any help will be appreciated. 我100%缺少对机器学习者来说非常明显的东西。我想t但是已经3天了,我慢慢地迷失了方向,所以我们将不胜感激。

Code -> 代码->

import nltk
import ast
import string
import re
import csv
import textblob
import pandas as pd
import numpy as np
import itertools
from textblob import TextBlob
from textblob import Word
from textblob.classifiers import NaiveBayesClassifier
from nltk.corpus import twitter_samples
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wd
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from random import shuffle
from nltk.classify.util import accuracy
from autocorrect import spell

stopwords = stopwords.words('english')
lemmatizer = nltk.WordNetLemmatizer().lemmatize
punct=['"','$','%','&','\',''','(',')','+',',','-     ','.','/',':',';','<','=','>','@','[','\',','^','_','`','{','|','}','~']

emoticons_happy = set([
':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
':^)', ':-D', ':D', ': D','8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
'=-3', '=3', ':-))', ':-)', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)',
'<3',':*', ':p'
])

emoticons_sad = set([
':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
':-[', ':-<', '=\\', '=/', '>:(', ':-(', '>.<', ":'-(", ":'(", ':\\', ':-c',
':c', ':{', '>:\\', ';('
])
emoticons = emoticons_happy.union(emoticons_sad)


def pre_process(tweet):

    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)

    tweet = re.sub(r'#', '', tweet)

    tweet=''.join([i for i in tweet if not i.isdigit()])

    tweet=re.sub(r'([.,/#!$%^&*;:{}=_`~-])([.,/#!$%^&*;:{}=_`~-]+)\1+', r'\1',tweet)

    tweet = re.sub(r'@[A-Za-z0-9]+', '', tweet)

    tweet=''.join([i for i in tweet if i not in emoticons])

    tweet=''.join([i for i in tweet if i not in punct])

    tweet=' '.join([i for i in tweet.split() if i not in stopwords])

    tweet=tweet.lower()

    tweet=lemmatize(tweet)

    return tweet

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wd.ADJ
    elif treebank_tag.startswith('V'):
        return wd.VERB
    elif treebank_tag.startswith('N'):
        return wd.NOUN
    elif treebank_tag.startswith('R'):
        return wd.ADV
    else:
        return wd.NOUN

def lemmatize(tt):
    pos = nltk.pos_tag(nltk.word_tokenize(tt))
    lemm = [lemmatizer(sw[0], get_wordnet_pos(sw[1])) for sw in pos]
    sentence= ' '.join([i for i in lemm])

    return sentence


test_tweets=[]
file=open('scraped_tweets.csv', 'r')
reader = csv.reader(file)
for line in reader:
    line=line[1]
    test_tweets.append(line)

pos_tweets = twitter_samples.strings('positive_tweets.json')
neg_tweets = twitter_samples.strings('negative_tweets.json')



proc_train_pos=[]
for tweet in pos_tweets:
    proc_train_pos.append(pre_process(tweet))
proc_train_neg=[]
for tweet in neg_tweets:
    proc_train_neg.append(pre_process(tweet))
proc_test=[]
for tweet in test_tweets:
    proc_test.append(pre_process(tweet))


def bag_of_words(tweet):
    words_dictionary = dict([word, True] for word in tweet.split())    
    return words_dictionary

pos_tweets_set = []
for tweet in proc_train_pos:
    pos_tweets_set.append((bag_of_words(tweet), 'pos'))    

neg_tweets_set = []
for tweet in proc_train_neg:
    neg_tweets_set.append((bag_of_words(tweet), 'neg'))

shuffle(pos_tweets_set)
shuffle(neg_tweets_set)
train_set = pos_tweets_set+neg_tweets_set

classifier = NaiveBayesClassifier(train_set)
print('Training is done')

#print(classifier.classify(bag_of_words('This is magnificent'))) #output 'pos'

print(nltk.classify.accuracy(classifier, proc_set))

Well, as the error message says, the classifier you are trying to use ( NaiveBayesClassifier ) doesn't have the method classify_many that the nltk.classify.util.accuracy function requires. 嗯,正如错误消息所述,您要使用的分类器( NaiveBayesClassifier )没有nltk.classify.util.accuracy函数所需的方法classify_many

(Reference: https://www.nltk.org/_modules/nltk/classify/naivebayes.html ) (参考: https : //www.nltk.org/_modules/nltk/classify/naivebayes.html

Now, that looks like an NLTK bug, but you can get your answer easily on your own: 现在,这看起来像是NLTK错误,但是您可以自己轻松地获得答案:

from sklearn.metrics import accuracy_score

y_predicted = [classifier.classify(x) for x in proc_set]

accuracy = accuracy_score(y_true, y_predicted)

Where y_true are the sentiment values corresponding to proc_set inputs (which I don't see you actually creating in your code shown above, though). 其中y_true是与proc_set输入相对应的情感值(不过,在上面显示的代码中我看不到您实际上在创建它)。

Hope that helps. 希望能有所帮助。

EDIT: 编辑:

Or, without using the sklearn accuracy function, but pure Python: 或者,不使用sklearn精度函数,而是使用纯Python:

hits = [yp == yt for yp, yt in zip(y_predicted, y_true)]

accuracy = sum(hits)/len(hits)

Three quick ideas (without rerunning all your code myself, since I don't have your data): 三个简单的想法(由于我没有您的数据,因此无需自己重新运行所有代码):

1) I don't see proc_set in your code above. 1)我在上面的代码proc_set不到proc_set Am I missing it, or is that the bug? 我是否想念它,还是那个错误?

2) I've see the syntax classifier.accuracy(proc_set) , so I'd try that just because it's easy. 2)我已经看到了语法classifier.accuracy(proc_set) ,所以我尝试这样做只是因为它很容易。 This seems to do the actual classification and accuracy in one step. 这似乎一步就可以完成实际的分类和准确性。

3) If that doesn't work: Does classifier.classify(proc_set) work? 3)如果不起作用: classifier.classify(proc_set)是否起作用? If so, you have the option to calculate accuracy yourself, which is pretty straight-forward. 如果是这样,您可以选择自己计算精度,这很简单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM