有沒有辦法使用列表中的python來分類/刪除單詞（例如，“哪個”，“潛在”，這個，“是”等）

Question

我目前正在從事與自然語言處理和文本挖掘有關的項目，我寫下了代碼來計算文本文件中唯一單詞的頻率。

Frequencey of:  trypanosomiasis --> 0.0029
Frequencey of:  deadly --> 0.0029
Frequencey of:  yellow --> 0.0029
Frequencey of:  humanassociated --> 0.0029
Frequencey of:  successful --> 0.0029
Frequencey of:  potential --> 0.0058
Frequencey of:  which --> 0.0029
Frequencey of:  cholera --> 0.01449
Frequencey of:  antimicrobial --> 0.0029
Frequencey of:  hostdirected --> 0.0029
Frequencey of:  cameroon --> 0.0029

是否有任何庫或方法可以從文本文件中刪除常用詞，幫助動詞的形容詞等（例如，“哪個”，“潛在”，這個，“是”等），以便我可以探索或計算最多科學術語可能會出現在文本數據中。

Answer 1

通常在文本分析中，您會刪除停用詞-那些對文本意義不大的常用詞。 您可以使用nltk的停用詞（來自https://pythonspot.com/en/nltk-stop-words/ ）將其刪除：

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []

for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)

print(wordsFiltered)

如果您要刪除其他字詞，可以將其添加到設置的stopWords

有沒有辦法使用列表中的python來分類/刪除單詞（例如，“哪個”，“潛在”，這個，“是”等）

問題描述

1 個解決方案

解決方案1
2 2017-05-04 11:32:25

有沒有辦法使用列表中的python來分類/刪除單詞（例如，“哪個”，“潛在”，這個，“是”等）

問題描述

1 個解決方案

解決方案1 2 2017-05-04 11:32:25

解決方案1
2 2017-05-04 11:32:25