計算NLTK語料庫中的不停詞

Question

在使用NLTK的python中，如何查找按類別過濾的文檔中不停詞的數量計數？

我可以弄清楚如何獲得按類別過濾的語料庫中的單詞，例如，“新聞”類別的棕色語料庫中的所有單詞為：

text = nltk.corpus.brown.words(categories=category)

另外，我可以弄清楚如何獲取特定文檔的所有單詞，例如棕色主體中文檔“ cj47”中的所有單詞是：

text = nltk.corpus.brown.words(fileids='cj47')

然后我可以遍歷結果並計算不是停用詞的單詞，例如

stopwords = nltk.corpus.stopwords.words('english')
for w in text:    
    if w.lower() not in stopwords:
#found a non stop words

但是，如何將它們放在一起，以便按類別過濾特定文檔？ 如果我嘗試同時指定類別和過濾器，例如

 text = nltk.corpus.brown.words(categories=category, fields=’cj47’)

我收到一條錯誤消息：

 ValueError: Specify fields or categories, not both

Answer 1

獲取類別的文件ID：
fileids = nltk.corpus.brown.fileids(categories=category)
對於每個文件，計算非停用詞：
for f in fileids: words = nltk.corpus.brown.words(fileids=f) sum = sum([1 for w in words if w.lower() not in stopwords]) print "Document %s: %d non-stopwords." % (f, sum)