简体   繁体   English

计算NLTK语料库中的不停词

[英]Counting non stop words in an NLTK corpus

In python using NLTK how would I find a count of the number of non stop words in a document filtered by category? 在使用NLTK的python中,如何查找按类别过滤的文档中不停词的数量计数?

I can figure out how to get the words in a corpus filtered by a category eg all the words in the brown corpus for category 'news' is: 我可以弄清楚如何获得按类别过滤的语料库中的单词,例如,“新闻”类别的棕色语料库中的所有单词为:

text = nltk.corpus.brown.words(categories=category)

And separately I can figure out how to get all the words for a particular document eg all the words in the document 'cj47' in the brown corpus is: 另外,我可以弄清楚如何获取特定文档的所有单词,例如棕色主体中文档“ cj47”中的所有单词是:

text = nltk.corpus.brown.words(fileids='cj47')

And then I can loop through the results and count up the words that are not stopwords eg 然后我可以遍历结果并计算不是停用词的单词,例如

stopwords = nltk.corpus.stopwords.words('english')
for w in text:    
    if w.lower() not in stopwords:
#found a non stop words

But how do I put it together so that I am filtering by category for a particular document? 但是,如何将它们放在一起,以便按类别过滤特定文档? If I try to specify a category and a filter at the same time eg 如果我尝试同时指定类别和过滤器,例如

 text = nltk.corpus.brown.words(categories=category, fields=’cj47’)

I get an error saying: 我收到一条错误消息:

 ValueError: Specify fields or categories, not both
  1. Get fileids for a category: 获取类别的文件ID:

    fileids = nltk.corpus.brown.fileids(categories=category)

  2. For each file, count the non-stopwords: 对于每个文件,计算非停用词:

    for f in fileids: words = nltk.corpus.brown.words(fileids=f) sum = sum([1 for w in words if w.lower() not in stopwords]) print "Document %s: %d non-stopwords." % (f, sum)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM