[英]How to remove stop words using NLTK
I'm having problems with removing stop words using NLTK.我在使用 NLTK 删除停用词时遇到问题。
I'm using the following code, which works without the part where I try to remove stop words.我正在使用以下代码,该代码没有我尝试删除停用词的部分。
from nltk.probability import FreqDist
from nltk.corpus import stopwords
text = open(r"C:\Users\meris\OneDrive\Dokumente\example.txt",encoding='utf-8').read()
token = word_tokenize(text)
clean_tokens = token[:]
sr = stopwords.words('and')
for token in token:
if token in stopwords.words('and'):
clean_tokens.remove(token)
for key,val in clean_tokens.items():
print (str(key) + ':' + str(val))
This is the error message I receive all the time:这是我一直收到的错误消息:
File "C:/Users/meris/PycharmProjects/pythonProject/main.py", line 22, in <module>
sr = stopwords.words('and')
File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\corpus\reader\wordlist.py", line 23, in words
for line in line_tokenize(self.raw(fileids))
File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\corpus\reader\wordlist.py", line 32, in raw
return concat([self.open(f).read() for f in fileids])
File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\corpus\reader\wordlist.py", line 32, in <listcomp>
return concat([self.open(f).read() for f in fileids])
File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\corpus\reader\api.py", line 208, in open
stream = self._root.join(file).open(encoding)
File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\data.py", line 337, in join
return FileSystemPathPointer(_path)
File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\compat.py", line 41, in _decorator
return init_func(*args, **kwargs)
File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\data.py", line 315, in __init__
raise IOError("No such file or directory: %r" % _path)
OSError: No such file or directory: 'C:\\Users\\meris\\AppData\\Roaming\\nltk_data\\corpora\\stopwords\\and' ```
Anyone has an idea how I could solve this?
You have to download and import the stopwords for the language you want, here I used the English
language as a reference.您必须下载并导入所需语言的停用词,这里我使用English
作为参考。
filtered_sentence
will contain the sentence without all the stopwords. filtered_sentence
将包含没有所有停用词的句子。
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in token if not w in stop_words]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.