如何使用 NLTK 删除停用词

Question

I'm having problems with removing stop words using NLTK.我在使用 NLTK 删除停用词时遇到问题。

I'm using the following code, which works without the part where I try to remove stop words.我正在使用以下代码，该代码没有我尝试删除停用词的部分。

from nltk.probability import FreqDist
from nltk.corpus import stopwords



text = open(r"C:\Users\meris\OneDrive\Dokumente\example.txt",encoding='utf-8').read()
token = word_tokenize(text)


clean_tokens = token[:]

sr = stopwords.words('and')

for token in token:

    if token in stopwords.words('and'):

        clean_tokens.remove(token)



for key,val in clean_tokens.items():

    print (str(key) + ':' + str(val))

This is the error message I receive all the time:这是我一直收到的错误消息：

  File "C:/Users/meris/PycharmProjects/pythonProject/main.py", line 22, in <module>
    sr = stopwords.words('and')
  File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\corpus\reader\wordlist.py", line 23, in words
    for line in line_tokenize(self.raw(fileids))
  File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\corpus\reader\wordlist.py", line 32, in raw
    return concat([self.open(f).read() for f in fileids])
  File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\corpus\reader\wordlist.py", line 32, in <listcomp>
    return concat([self.open(f).read() for f in fileids])
  File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\corpus\reader\api.py", line 208, in open
    stream = self._root.join(file).open(encoding)
  File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\data.py", line 337, in join
    return FileSystemPathPointer(_path)
  File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "C:\Users\meris\PycharmProjects\pythonProject\venv\lib\site-packages\nltk\data.py", line 315, in __init__
    raise IOError("No such file or directory: %r" % _path)
OSError: No such file or directory: 'C:\\Users\\meris\\AppData\\Roaming\\nltk_data\\corpora\\stopwords\\and'   ```




Anyone has an idea how I could solve this?

Answer 1

You have to download and import the stopwords for the language you want, here I used the English language as a reference.您必须下载并导入所需语言的停用词，这里我使用English作为参考。

filtered_sentence will contain the sentence without all the stopwords. filtered_sentence将包含没有所有停用词的句子。

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))  
filtered_sentence = [w for w in token if not w in stop_words]

如何使用 NLTK 删除停用词

问题描述

1 个解决方案

解决方案1
0 2021-01-29 14:24:24

如何使用 NLTK 删除停用词

问题描述

1 个解决方案

解决方案1 0 2021-01-29 14:24:24

解决方案1
0 2021-01-29 14:24:24