简体   繁体   English

从大文本文件中过滤停用词(使用包:nltk.corpus)

[英]Filtering stop words out of a large text file (using package: nltk.corpus)

I'm trying to rank the most frequently used words in a large text file - - Alice and Wonderland (which is public domain). 我试图在大文本文件中排列最常用的单词 - 爱丽丝和仙境(公共领域)。 Here is Alice and Wonderland on Dropbox and on Pastebin . 这是DropboxPastebin上的Alice和Wonderland。 It runs and as expected, there are 1818 instances of “the” and 940 instances of “and”. 它运行并且如预期的那样,有1818个“the”实例和940个“and”实例。

But now in my latest iteration of the script, I'm trying to filter out the most commonly used words such as “and”, “there”, “the”, “that”, “to”, “a” etc. Any search algorithm out there looks for words like these (called stop words in SEO terminology) and excludes them from the query. 但现在在我最新的脚本迭代中,我试图过滤掉最常用的单词,如“and”,“there”,“the”,“that”,“to”,“a”等等。搜索算法在那里寻找像这样的单词(在SEO术语中称为停用词 )并从查询中排除它们。 The Python library I've imported for this task is nltk.corpus . 我为此任务导入的Python库是nltk.corpus

When I generate a stop words list and invoke the filter, all the instances of “the” and “of” are filtered out as expected but it's not catching “and” or “you”. 当我生成一个停用词列表并调用过滤器时,“the”和“of”的所有实例都按预期过滤掉了,但它没有捕获“和”或“你”。 It's not clear to me why. 我不清楚为什么。

I've tried reinforcing the stop words list by manually and explicitly adding words which appear in the output that shouldn't be there. 我已经尝试通过手动和明确添加出现在输出中不应该出现的单词来强化停用词列表。 I've added “said”, “you”, “that”, and others yet they still appear as among the top 10 most common words in the text file. 我添加了“说”,“你”,“那个”等,但它们仍然是文本文件中最常见的10个单词之一。

Here is my script: 这是我的脚本:

from collections import Counter
from nltk.corpus import stopwords
import re

def open_file():
   with open('Alice.txt') as f:
       text = f.read().lower()
   return text

def main(text):
   stoplist = stopwords.words('english') # Bring in the default English NLTK stop words
   stoplist.extend(["said", "i", "it", "you", "and","that",])
   # print(stoplist)
   clean = [word for word in text.split() if word not in stoplist]
   clean_text = ' '.join(clean)
   words = re.findall('\w+', clean_text)
   top_10 = Counter(words).most_common(10)
   for word,count in top_10:
       print(f'{word!r:<4} {"-->":^4} {count:>4}')

if __name__ == "__main__":
   text = open_file()
   main(text)

Here is my actual output: 这是我的实际输出:

$ python script8.py $ python script8.py

'alice' --> 403 'alice' - > 403

'i' --> 283 '我' - > 283

'it' --> 205 '它' - > 205

's' --> 184 's' - > 184

'little' --> 128 '小' - > 128

'you' --> 115 '你' - > 115

'and' --> 107 '和' - > 107

'one' --> 106 '一' - > 106

'gutenberg' --> 93 'gutenberg' - > 93

'that' --> 92 '那' - > 92

What I am expecting is for all the instances of "i", "it" and "you" to be excluded from this list but they are still appearing and it is not clear to me why. 我期待的是,所有“i”,“it”和“you”的实例都被排除在此列表之外,但它们仍然出现,我不清楚为什么。

Your code does this: 您的代码执行此操作:

  1. First you split the text on whitespace using text.split() . 首先,使用text.split()在空白上拆分文本。 But the resulting list of 'words' still includes punctuation, like as, , head!' 但由此产生的“话”名单仍包括标点符号,如as,head!' and 'i (note that ' is used as a quotation-mark as well as an apostrophe). 'i (注意'用作引号和撇号)。

  2. Then you exclude any 'words' that have a match in stopwords . 然后排除任何在stopwords词中匹配的“单词”。 This will exclude i but not 'i . 这将排除i但不是'i

  3. Next you re-join all the remaining words using spaces. 接下来,使用空格重新连接所有剩余的单词。

  4. Then you use a '\\w+' regex to search for sequences of letters (NOT including punctuation): so 'i will match as i . 然后你使用'\\w+'正则表达式来搜索字母序列(不包括标点符号):所以'i将匹配为i That's why i and s are showing up in your top 10. 这就是为什么is都出现在你的前10名了。

There are a couple ways to fix this. 有几种方法可以解决这个问题。 For example, you can use re.split() to split on more than just whitespace: 例如,您可以使用re.split()来拆分不仅仅是空格:

def main(text):
   stoplist = stopwords.words('english')
   stoplist.extend(["said"]) # stoplist already includes "i", "it", "you"
   clean = [word for word in re.split(r"\W+", text) if word not in stoplist]
   top_10 = Counter(clean).most_common(10)
   for word,count in top_10:
       print(f'{word!r:<4} {"-->":^4} {count:>4}')

Output: 输出:

'alice' -->   403
'little' -->   128
'one' -->   106
'gutenberg' -->    93
'know' -->    88
'project' -->    87
'like' -->    85
'would' -->    83
'went' -->    83
'could' -->    78

Note that this is treating hyphenated phrases separately: so gutenberg-tm -> gutenberg , tm . 请注意,这是分开处理带连字符的短语:所以gutenberg-tm - > gutenbergtm For more control over this, you could follow Jay's suggestion and look at nltk.tokenize . 为了更好地控制这个,你可以按照Jay的建议看看nltk.tokenize For example, the nltk tokenizer is aware of contractions, so don't -> do + n't . 例如,nltk tokenizer知道收缩,所以don't - > do + n't

You could also improve things by removing the Gutenberg Licensing conditions from your text :) 您还可以通过从文本中删除Gutenberg许可条件来改进工作:)

for example: 例如:

"it's".split() >> [it's] "it's".split() >> [它是]

re.findall('\\w+', "it's") >> [it, s] re.findall('\\w+', "it's") >> [它,s]

that is why "stoplist" won't be like you think. 这就是为什么“停止列表”不会像你想的那样。

fix: 固定:

def main(text):
    words = re.findall('\w+', text)
    counter = Counter(words)
    stoplist = stopwords.words('english')
    #stoplist.extend(["said", "i", "it", "you", "and", "that", ])
    stoplist.extend(["said", "i", "it", "you"])
    [stoplist.remove(keep_word) for keep_word in ['s', 'and', 'that']]
    for stop_word in stoplist:
        del counter[stop_word]
    for word, count in counter.most_common(10):
        print(f'{word!r:<4} {"-->":^4} {count:>4}')

output 产量

'and' -->   940
'alice' -->   403
'that' -->   330
's'  -->   219
'little' -->   128
'one' -->   106
'gutenberg' -->    93
'know' -->    88
'project' -->    86
'like' -->    85

note: "i", "it" and "you" to be excluded from your list 注意: "i", "it" and "you" to be excluded from your list

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM