计算NLTK语料库中的不停词

Question

In python using NLTK how would I find a count of the number of non stop words in a document filtered by category? 在使用NLTK的python中，如何查找按类别过滤的文档中不停词的数量计数？

I can figure out how to get the words in a corpus filtered by a category eg all the words in the brown corpus for category 'news' is: 我可以弄清楚如何获得按类别过滤的语料库中的单词，例如，“新闻”类别的棕色语料库中的所有单词为：

text = nltk.corpus.brown.words(categories=category)

And separately I can figure out how to get all the words for a particular document eg all the words in the document 'cj47' in the brown corpus is: 另外，我可以弄清楚如何获取特定文档的所有单词，例如棕色主体中文档“ cj47”中的所有单词是：

text = nltk.corpus.brown.words(fileids='cj47')

And then I can loop through the results and count up the words that are not stopwords eg 然后我可以遍历结果并计算不是停用词的单词，例如

stopwords = nltk.corpus.stopwords.words('english')
for w in text:    
    if w.lower() not in stopwords:
#found a non stop words

But how do I put it together so that I am filtering by category for a particular document? 但是，如何将它们放在一起，以便按类别过滤特定文档？ If I try to specify a category and a filter at the same time eg 如果我尝试同时指定类别和过滤器，例如

 text = nltk.corpus.brown.words(categories=category, fields=’cj47’)

I get an error saying: 我收到一条错误消息：

 ValueError: Specify fields or categories, not both

Answer 1

Get fileids for a category: 获取类别的文件ID：
fileids = nltk.corpus.brown.fileids(categories=category)
For each file, count the non-stopwords: 对于每个文件，计算非停用词：
for f in fileids: words = nltk.corpus.brown.words(fileids=f) sum = sum([1 for w in words if w.lower() not in stopwords]) print "Document %s: %d non-stopwords." % (f, sum)

计算NLTK语料库中的不停词

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-03-10 17:45:21

计算NLTK语料库中的不停词

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-03-10 17:45:21

解决方案1
1 已采纳 2016-03-10 17:45:21