简体   繁体   English

有没有办法使用列表中的python来分类/删除单词(例如,“哪个”,“潜在”,这个,“是”等)

[英]Is there any way to classify/ remove words (Exm. “Which”, “potential”, this, “are” etc.) using python from a list

I am currently working on project related to natural language processing and text mining i have write down a code to calculate the frequency of unique words in a text file. 我目前正在从事与自然语言处理和文本挖掘有关的项目,我写下了代码来计算文本文件中唯一单词的频率。

Frequencey of:  trypanosomiasis --> 0.0029
Frequencey of:  deadly --> 0.0029
Frequencey of:  yellow --> 0.0029
Frequencey of:  humanassociated --> 0.0029
Frequencey of:  successful --> 0.0029
Frequencey of:  potential --> 0.0058
Frequencey of:  which --> 0.0029
Frequencey of:  cholera --> 0.01449
Frequencey of:  antimicrobial --> 0.0029
Frequencey of:  hostdirected --> 0.0029
Frequencey of:  cameroon --> 0.0029

Is there any library or method that can remove common words, adjectives helping verbs etc. (Exm. "Which", "potential", this, "are" etc.) from a text file so that I can explore the or calculate the most likely occurrence of scientific terminology into a text data. 是否有任何库或方法可以从文本文件中删除常用词,帮助动词的形容词等(例如,“哪个”,“潜在”,这个,“是”等),以便我可以探索或计算最多科学术语可能会出现在文本数据中。

Usually in text analysis you remove stopwords - common words that hold little meaning about the text. 通常在文本分析中,您会删除停用词-那些对文本意义不大的常用词。 These you can remove using nltk's stopwords (from https://pythonspot.com/en/nltk-stop-words/ ): 您可以使用nltk的停用词(来自https://pythonspot.com/en/nltk-stop-words/ )将其删除:

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []

for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)

print(wordsFiltered)

If there are additional words you want to remove, you can just add them to the set stopWords 如果您要删除其他字词,可以将其添加到设置的stopWords

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从Python中的字符串中删除所有文章,连接词等 - Remove all articles, connector words, etc., from a string in Python 如何从 python 列表中的字符串中删除 \n1、\n2、\n3 等? - How to remove \n1, \n2, \n3 etc. from a string in python list? Python:更好的方式来获取每个字母/数字/等。 从一个字符串并将其转换为一个列表? - Python: Better way to take every letter/number/etc. from a string and convert it into a list? Python:调用预定义变量/列表/等。 从用户输入 - Python: Calling predefined variables/list/etc. from user input 有没有更快的方法可以通过python用nltk从单词列表中进行检查? - Is there any faster way to check from a words-list with nltk with python? 从列表中删除字符串中的单词 - Remove words from the string which are present in list 使用列表/元组/等。 从键入与直接将类型引用为列表/元组/等 - Using List/Tuple/etc. from typing vs directly referring type as list/tuple/etc 仅使用Python中的Numpy从列表中删除停用词 - Remove stop words from list using only Numpy in Python 使用 python 从列表中提取满足特定条件的单词 - Extract words from the list which meet certain conditions using python 如何找到两个单词的相似性并以更有效的方式从列表中删除任何进一步的相似性? - How to find a similarity of two words and remove any further similarities from a list in a more efficient way?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM