简体   繁体   English

如何使用 NLTK 删除停用词

[英]How to remove stop words using NLTK

I'm having problems with removing stop words using NLTK.我在使用 NLTK 删除停用词时遇到问题。

I'm using the following code, which works without the part where I try to remove stop words.我正在使用以下代码,该代码没有我尝试删除停用词的部分。

from nltk.probability import FreqDist
from nltk.corpus import stopwords



text = open(r"C:\Users\meris\OneDrive\Dokumente\example.txt",encoding='utf-8').read()
token = word_tokenize(text)


clean_tokens = token[:]

sr = stopwords.words('and')

for token in token:

    if token in stopwords.words('and'):

        clean_tokens.remove(token)



for key,val in clean_tokens.items():

    print (str(key) + ':' + str(val))


This is the error message I receive all the time:这是我一直收到的错误消息:

  File "C:/Users/meris/PycharmProjects/pythonProject/main.py", line 22, in <module>
    sr = stopwords.words('and')
  File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\corpus\reader\wordlist.py", line 23, in words
    for line in line_tokenize(self.raw(fileids))
  File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\corpus\reader\wordlist.py", line 32, in raw
    return concat([self.open(f).read() for f in fileids])
  File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\corpus\reader\wordlist.py", line 32, in <listcomp>
    return concat([self.open(f).read() for f in fileids])
  File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\corpus\reader\api.py", line 208, in open
    stream = self._root.join(file).open(encoding)
  File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\data.py", line 337, in join
    return FileSystemPathPointer(_path)
  File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\data.py", line 315, in __init__
    raise IOError("No such file or directory: %r" % _path)
OSError: No such file or directory: 'C:\\Users\\meris\\AppData\\Roaming\\nltk_data\\corpora\\stopwords\\and'   ```




Anyone has an idea how I could solve this?

You have to download and import the stopwords for the language you want, here I used the English language as a reference.您必须下载并导入所需语言的停用词,这里我使用English作为参考。

filtered_sentence will contain the sentence without all the stopwords. filtered_sentence将包含没有所有停用词的句子。

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))  
filtered_sentence = [w for w in token if not w in stop_words]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM