有没有办法使用列表中的python来分类/删除单词（例如，“哪个”，“潜在”，这个，“是”等）

Question

I am currently working on project related to natural language processing and text mining i have write down a code to calculate the frequency of unique words in a text file. 我目前正在从事与自然语言处理和文本挖掘有关的项目，我写下了代码来计算文本文件中唯一单词的频率。

Frequencey of:  trypanosomiasis --> 0.0029
Frequencey of:  deadly --> 0.0029
Frequencey of:  yellow --> 0.0029
Frequencey of:  humanassociated --> 0.0029
Frequencey of:  successful --> 0.0029
Frequencey of:  potential --> 0.0058
Frequencey of:  which --> 0.0029
Frequencey of:  cholera --> 0.01449
Frequencey of:  antimicrobial --> 0.0029
Frequencey of:  hostdirected --> 0.0029
Frequencey of:  cameroon --> 0.0029

Is there any library or method that can remove common words, adjectives helping verbs etc. (Exm. "Which", "potential", this, "are" etc.) from a text file so that I can explore the or calculate the most likely occurrence of scientific terminology into a text data. 是否有任何库或方法可以从文本文件中删除常用词，帮助动词的形容词等（例如，“哪个”，“潜在”，这个，“是”等），以便我可以探索或计算最多科学术语可能会出现在文本数据中。

Answer 1

Usually in text analysis you remove stopwords - common words that hold little meaning about the text. 通常在文本分析中，您会删除停用词-那些对文本意义不大的常用词。 These you can remove using nltk's stopwords (from https://pythonspot.com/en/nltk-stop-words/ ): 您可以使用nltk的停用词（来自https://pythonspot.com/en/nltk-stop-words/ ）将其删除：

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []

for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)

print(wordsFiltered)

If there are additional words you want to remove, you can just add them to the set stopWords 如果您要删除其他字词，可以将其添加到设置的stopWords

有没有办法使用列表中的python来分类/删除单词（例如，“哪个”，“潜在”，这个，“是”等）

问题描述

1 个解决方案

解决方案1
2 2017-05-04 11:32:25

有没有办法使用列表中的python来分类/删除单词（例如，“哪个”，“潜在”，这个，“是”等）

问题描述

1 个解决方案

解决方案1 2 2017-05-04 11:32:25

解决方案1
2 2017-05-04 11:32:25