[英]Removing accented words from stop words algorithm with NLTK
I'm trying to develop a simple algorithm in Python to remove stop words from a text, but I'm having problems with words that have accents. 我正在尝试在Python中开发一种简单的算法,以从文本中删除停用词,但对于带有重音的词却遇到了问题。 I'm using the following code:
我正在使用以下代码:
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from unicodedata import normalize
import sys
reload(sys)
sys.setdefaultencoding('utf8')
stop_words = set(stopwords.words('portuguese'))
file1 = open("C:\Users\Desktop\Test.txt")
print("File open")
line = file1.read()
words = line.split()
#convert the words to lower case
words = [word.lower() for word in words]
print("Running!")
for r in words:
if r not in stop_words:
appendFile = open('finalText.txt','a')
appendFile.writelines(" "+r)
appendFile.close()
print("Finished!")
When running the code with the following test file: 使用以下测试文件运行代码时:
E É Á A O Ó U Ú
I have this output: 我有这个输出:
É Á Ó Ú
It doesn't seem to recognize accentuated words, and using "setdefaultencoding" for utf-8 does not work, does anyone knows of a solution I can use to solve this problem? 它似乎无法识别重读的单词,并且对utf-8使用“ setdefaultencoding”不起作用,有人知道我可以用来解决此问题的解决方案吗?
It is not an encoding or accent problem. 这不是编码或口音问题。 These are simply words that are not in the list:
这些只是不在列表中的单词:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('portuguese'))
print(" ".join([w for w in stop_words if len(w) == 1]))
# >>> e à o a
# -> does not contain á é ó ú
print("À".lower() in stop_words)
# >>> True
You can just add words to the set ( stop_words.add("é")
) if you need to. 您可以根据需要将单词添加到集合中(
stop_words.add("é")
)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.