简体   繁体   English

从python的计数器中删除停用词列表

[英]Remove a list of stopwords from a Counter in python

I've got a function in NLTK to generate a concordance list, which would look like 我在NLTK中有一个函数来生成一个一致性列表,看起来像

concordanceList = ["this is a concordance string something", 
               "this is another concordance string blah"] 

and I have another function which returns a Counter dictionary with the counts of each word in the concordanceList 我还有另一个函数可以返回Counter字典,其中concordanceList中每个单词的计数

def mostCommonWords(concordanceList):
  finalCount = Counter()
  for line in concordanceList:
    words = line.split(" ")
    currentCount = Counter(words)
    finalCount.update(currentCount)
  return finalCount

The problem I have is how best to remove stopwords from the resulting Counter, so that, when I call 我的问题是如何最好地从结果计数器中删除停用词,以便在我打电话时

mostCommonWords(concordanceList).most_common(10)

the result isn't just {"the": 100, "is": 78, "that": 57}. 结果不只是{“ the”:100,“ is”:78,“ that”:57}。

I think that pre-processing the text to remove stopwords is out, because I still need the concordance strings to be instances of grammatical language. 我认为对文本进行预处理以除去停用词已经过时了,因为我仍然需要使一致字符串成为语法语言的实例。 Basically, I'm asking if there's a simpler way to do this than creating a stopwords Counter for stopwords, setting the values low, and then making yet another Counter like so: 基本上,我想问的是,是否有比创建停用词的停用词计数器,将值设置为低然后再创建另一个Counter更为简单的方法:

stopWordCounter = Counter(the=1, that=1, so=1, and=1)
processedWordCounter = mostCommonWords(concordanceList) & stopWordCounter

which should set the count values for all stopwords to 1, but it seems hacky. 应该将所有停用词的计数值设置为1,但这似乎很麻烦。

Edit: Additionally, I'm having trouble actually making such a stopWordCounter, because if I want to include reserved words like "and", I get an invalid syntax error. 编辑:此外,我在实际制作这样的stopWordCounter时遇到了麻烦,因为如果我想包含诸如“ and”之类的保留字,则会收到无效的语法错误。 Counters have easy to use union and intersection methods, which would make the task fairly simple; 计数器具有易于使用的联合和交集方法,这将使任务相当简单。 are there equivalent methods for dictionaries? 有等同的字典方法吗?

You can remove the stop words during tokenization... 您可以在标记化过程中删除停用词...

stop_words = frozenset(['the', 'a', 'is'])
def mostCommonWords(concordanceList):
    finalCount = Counter()
    for line in concordanceList:
        words = [w for w in line.split(" ") if w not in stop_words]
        finalCount.update(words)  # update final count using the words list
    return finalCount

First, you don't need to create all those new Counter s inside your function; 首先,您不需要在函数内部创建所有这些新Counter you can do: 你可以做:

for line in concordanceList:
    finalCount.update(line.split(" "))

instead. 代替。

Second, a Counter is a kind of dictionary, so you can delete items directly: 其次, Counter是一种字典,因此您可以直接删除项目:

for sword in stopwords:
    del yourCounter[sword]

It doesn't matter whether sword is in the Counter - this won't raise an exception regardless. sword是否在Counter上都没关系-无论如何这不会引发异常。

I'd go for flattening the items into words, ignoring any stop words and providing that as input to a single Counter instead: 我会尝试将项目拼合为单词,忽略任何停用词,并将其作为输入提供给单个Counter

from collections import Counter
from itertools import chain

lines = [
    "this is a concordance string something", 
    "this is another concordance string blah"
]

stops = {'this', 'that', 'a', 'is'}    
words = chain.from_iterable(line.split() for line in lines)
count = Counter(word for word in words if word not in stops)

Or, that last bit can be done as: 或者,最后一点可以通过以下方式完成:

from itertools import ifilterfalse
count = Counter(ifilterfalse(stops.__contains__, words))

You have a couple of options. 您有两种选择。

One, don't count the stopwords when updating your Counter - which you can do more concisely, since Counter objects can accept an iterable as well as another mapping for update : 一个,在更新Counter时不要计算停用词-您可以做得更简洁一些,因为Counter对象可以接受迭代的以及另一个update映射:

def mostCommonWords(concordanceList):
    finalCount = Counter()
    stopwords = frozenset(['the', 'that', 'so'])
    for line in concordanceList:
        words = line.strip().split(' ')
        finalCount.update([word for word in words if word not in stopwords])
    return finalCount

Alternatively, you can use del to actually remove them from the Counter once you're done. 另外,一旦完成,您可以使用del将它们从Counter实际删除。

I've also added a strip call on line prior to split . 我还添加一个strip上的呼叫line之前split If you were to use split() and the default behavior of splitting on all whitespace, you wouldn't need that, but split(' ') will not consider newline to be something to split on, so the last word of each line would have a trailing \\n and would be considered distinct from any other appearances. 如果要使用split()和在所有空白处进行拆分的默认行为,则不需要这样做,但是split(' ')不会将换行符视为要拆分的内容,因此每行的最后一个单词将是\\n后缀为\\n ,将被视为与其他任何外观都不相同。 strip gets rid of that. strip摆脱的那个。

How about: 怎么样:

if 'the' in counter:
    del counter['the']

Personally, i think @JonClements' answer was the most elegant. 我个人认为,@ JonClements的回答是最优雅的。 BTW, there is already a list of stopwords in NLTK, just in case the OP didn't know, see NLTK stopword removal issue 顺便说一句,NLTK中已经有stopwords列表,以防万一OP不知道,请参见NLTK停用词删除问题

from collections import Counter
from itertools import chain
from nltk.corpus import stopwords

lines = [
    "this is a concordance string something", 
    "this is another concordance string blah"
]

stops = stopwords.words('english')
words = chain.from_iterable(line.split() for line in lines)
count = Counter(word for word in words if word not in stops)
count = Counter(ifilterfalse(stops.__contains__, words))

Also, the FreqDist module in NLTK has more NLP related features as compared to collections.Counter . 此外,与collections.Counter相比,NLTK中的FreqDist模块具有更多与NLP相关的功能。 http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM