使用NLTK從停用詞算法中刪除重音詞

Question

我正在嘗試在Python中開發一種簡單的算法，以從文本中刪除停用詞，但對於帶有重音的詞卻遇到了問題。 我正在使用以下代碼：

import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from unicodedata import normalize
import sys

reload(sys)
sys.setdefaultencoding('utf8')

stop_words = set(stopwords.words('portuguese'))
file1 = open("C:\Users\Desktop\Test.txt")
print("File open")
line = file1.read()
words = line.split()
#convert the words to lower case
words = [word.lower() for word in words]
print("Running!")
for r in words:
    if r not in stop_words:
            appendFile = open('finalText.txt','a')
            appendFile.writelines(" "+r)
            appendFile.close()

print("Finished!")

使用以下測試文件運行代碼時：

E É Á A O Ó U Ú

我有這個輸出：

 É Á Ó Ú

它似乎無法識別重讀的單詞，並且對utf-8使用“ setdefaultencoding”不起作用，有人知道我可以用來解決此問題的解決方案嗎？

Answer 1

這不是編碼或口音問題。 這些只是不在列表中的單詞：

from nltk.corpus import stopwords
stop_words = set(stopwords.words('portuguese'))

print(" ".join([w for w in stop_words if len(w) == 1]))
# >>> e à o a
# -> does not contain á é ó ú

print("À".lower() in stop_words)
# >>> True

您可以根據需要將單詞添加到集合中（ stop_words.add("é") ）。

使用NLTK從停用詞算法中刪除重音詞

問題描述

1 個解決方案

解決方案1
0 已采納 2018-06-28 08:34:08

使用NLTK從停用詞算法中刪除重音詞

問題描述

1 個解決方案

解決方案1 0 已采納 2018-06-28 08:34:08

解決方案1
0 已采納 2018-06-28 08:34:08