簡體   English   中英

使用NLTK for Python訓練推文進行情感分析

[英]Train corpus of Tweets for Sentiment Analysis, using NLTK for Python

我正在嘗試使用NLTK用於python訓練自己的語料庫以進行情感分析。 我有兩個文本文件:一個具有25K肯定的推文,每行隔開,另一個具有25K否定的推文。

我使用這篇Stackoverflow文章,方法2

當我運行以下代碼來創建語料庫時:

import string
from itertools import chain

from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.corpus import CategorizedPlaintextCorpusReader
import nltk

mydir = 'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews'

mr = CategorizedPlaintextCorpusReader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents[numtrain:]]

classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

我收到錯誤消息:

C:\Users\gerbuiker\Anaconda\python.exe "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/CreateCorpus.py"
Traceback (most recent call last):
  File "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/CreateCorpus.py", line 23, in <module>
    documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
  File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\corpus\reader\util.py", line 336, in iterate_from
    assert self._len is not None
AssertionError

Process finished with exit code 1

有誰知道如何解決這一問題?

我不是100%積極的,因為我目前不在Windows機器上進行測試,但是我認為,可能引起您關注的是@alvas原始示例中的路徑斜線方向與您對Windows的適應之間的差異。

具體來說,您使用: 'C:\\Users\\gerbuiker\\Desktop\\Sentiment Analyse\\my_movie_reviews'而他的示例使用'/home/alvas/my_movie_reviews' 在大多數情況下,這很好,但是您嘗試重用他的cat_pattern正則表達式: r'(neg|pos)/.*' ,它將與他的路徑中的斜杠匹配,但拒絕您路徑中的斜杠。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM