简体   繁体   English

过滤掉停用词

[英]Filtering out stopwords

I've created a simple word count program and I'm trying to filter out commonly used words from my list using nltk (see below). 我创建了一个简单的单词计数程序,并尝试使用nltk从列表中过滤掉常用单词(请参见下文)。

My question is how would I apply my "stop" filter to my "frequency" list? 我的问题是如何将“停止”过滤器应用于“频率”列表?

#Start 
from nltk.corpus import stopwords
import re
import string
frequency = {}
document_text = open('Import.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)

for word in match_pattern:
    count = frequency.get(word,0)
    frequency[word] = count + 1

frequency = {k:v for k,v in frequency.items() if v>1}

stop = set(stopwords.words('english'))
stop = list(stop)
stop.append(".")

import csv

with open('Export.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    for key, value in frequency.items():
       writer.writerow([key, value])
stop = set(stopwords.words('english'))
stop.(".")

frequency = {k:v for k,v in frequency.items() if v>1 and k not in stop}

While stop is still a set , check the keys of your frequency dictionary when doing the comprehension. 虽然stop仍然是set ,但在理解时请检查frequency字典的键。 You can still make stop a list again afterwards. 之后您仍然可以再次停止列表。

The reason I keep it as a set is because it is much more efficient to search sets than it is to search lists. 之所以将其保留为集合,是因为搜索集合比搜索列表要有效得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM