NLTK停用詞刪除問題

Question

我正在嘗試按照NLTK第6章中的描述進行文檔分類，但是在刪除停用詞時遇到了麻煩。 當我添加

all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english'))

它返回

Traceback (most recent call last):
  File "fiction.py", line 8, in <module>
    word_features = all_words.keys()[:100]
AttributeError: 'generator' object has no attribute 'keys'

我猜測停用詞代碼更改了用於“ all_words”的對象的類型，從而使它們的.key（）函數無效。 在使用鍵功能之前，如何在不更改其類型的情況下刪除停用詞？ 完整代碼如下：

import nltk 
from nltk.corpus import PlaintextCorpusReader

corpus_root = './nltk_data/corpora/fiction'
fiction = PlaintextCorpusReader(corpus_root, '.*')
all_words=nltk.FreqDist(w.lower() for w in fiction.words())
all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english'))
word_features = all_words.keys()[:100]

def document_features(document): # [_document-classify-extractor]
    document_words = set(document) # [_document-classify-set]
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

print document_features(fiction.words('fic/11.txt'))

Answer 1

為此，我首先避免將它們添加到FreqDist實例中：

all_words=nltk.FreqDist(w.lower() for w in fiction.words() if w.lower() not in nltk.corpus.stopwords.words('english'))

根據您的語料庫大小，我認為您可以在創建停用詞集之前提高性能：

stopword_set = frozenset(ntlk.corpus.stopwords.words('english'))

如果這不適合您的情況，您似乎可以利用FreqDist繼承自dict的事實：

for stopword in nltk.corpus.stopwords.words('english'):
    if stopword in all_words:
        del all_words[stopword]

NLTK停用詞刪除問題

問題描述

1 個解決方案

解決方案1
4 已采納 2013-12-23 01:04:46

NLTK停用詞刪除問題

問題描述

1 個解決方案

解決方案1 4 已采納 2013-12-23 01:04:46

解決方案1
4 已采納 2013-12-23 01:04:46