![](/img/trans.png)
[英]OpenERP - UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode
[英]Python: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
我試圖使用NLTK對文本正文中的單詞進行計數。 我在文本文件中讀取並嘗試轉換為小寫,刪除標點符號和標記化。 然后刪除停用詞,然后計算最常用的單詞。 但是,我收到以下錯誤:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
這是我的代碼:
import nltk
import string
from nltk.corpus import stopwords
from collections import Counter
def get_tokens():
with open('/Users/user/Code/abstract/data/Training(3500)/3500_Response_Tweets. txt', 'r') as r_tweets:
text = r_tweets.read()
lowers = text.lower()
#remove the punctuation using the character deletion step of translate
no_punctuation = lowers.translate(None, string.punctuation)
tokens = nltk.word_tokenize(no_punctuation)
return tokens
tokens = get_tokens()
filtered = [w for w in tokens if not w in stopwords.words('english')]
count = Counter(filtered)
print count.most_common(100)
以及警告,我的輸出看起來像:
[('so', 268), ('\xe2\x80\x8e\xe2\x80\x8fi', 231), ('like', 192), ('know', 157), ('dont', 137), ('get', 125), ('im', 122), ('would', 118), ('\xe2\x80\x8e\xe2\x80\x8fbut', 118), ('\xe2\x80\x8e\xe2\x80\x8foh', 114), ('right', 113), ('good', 105), ('\xe2\x80\x8e\xe2\x80\x8fyeah', 95), ('sure', 94), ('one', 92),
使用codecs.open時出現回溯錯誤:
Traceback (most recent call last):
File "tfidf.py", line 16, in <module>
tokens = get_tokens()
File "tfidf.py", line 12, in get_tokens
no_punctuation = lowers.translate(None, string.punctuation)
TypeError: translate() takes exactly one argument (2 given)
我的建議:使用io.open('filename.txt', 'r', encoding='utf8')
。 然后你得到漂亮的unicode對象而不是丑陋的字節對象。
這適用於Python2和Python3。 請參閱: https : //stackoverflow.com/a/22288895/633961
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.