简体   繁体   English

Python:UnicodeEncodeError:'ascii'编解码器无法在位置34-39处编码字符:序数不在范围内(128)

[英]Python: UnicodeEncodeError: 'ascii' codec can't encode characters in position 34-39: ordinal not in range(128)

I've got a data of twitter log and I have to sort the file to show each user's retweeted tweet ranking. 我有一个Twitter日志数据,我必须对文件进行排序以显示每个用户的转发推文排名。

Here's the code. 这是代码。

import codecs

with codecs.open('hoge_qdata.tsv', 'r', 'utf-8') as tweets:
    tweet_list = tweets.readlines()

tweet_list.pop(0)

facul={}
for t in tweet_list:
    t = t.split('\t')
    t[-2] = int(t[-2])   
    if t[-2] <= 0:      
        continue        
    if not t[0] in facul:
        facul[t[0]] = []
    facul[t[0]].append(t)

def cmp_retweet(a,b):
    if a[-2] < b[-2]:
        return 1
    if a[-2] > b[-2]:
        return -1
    return 0

for f in sorted(facul.keys()):
    facul[f].sort(cmp=cmp_retweet)
    print ('[%s]' %(f))
    for t in facul[f][:5]:
        print ('%d:%s:%s' % (t[-2], t[2], t[-1].strip())

Somehow I got an error saying: 我不知何故出现了一个错误,说:

print '%d:%s:%s' %(t[-2], t[2], t[-1].strip())
UnicodeEncodeError: 'ascii' codec can't encode characters in position
34-39: ordinal not in range(128)

Looks like Japanese language letters can't be decoded. 日语字母似乎无法解码。 How can I fix this? 我怎样才能解决这个问题? I tried to use sys.setdefaultencoding("utf-8") but then I got an error: 我尝试使用sys.setdefaultencoding("utf-8")但随后出现错误:

sys.setdefaultencoding("utf-8")
AttributeError: 'module' object has no attribute 'setdefaultencoding'

This is how I tried it: 这是我尝试的方法:

import codecs
import sys
sys.setdefaultencoding("utf-8")

with codecs.open('hoge_qdata.tsv', 'r', 'utf-8') as tweets:
    tweet_list = tweets.readlines()

ps I am using Python version 2.7.5 ps我正在使用Python版本2.7.5

The basic issue, as you have discovered, is that ASCII cannot represent much of unicode. 正如您所发现的,基本问题是ASCII不能代表很多unicode。

So you have to make a choice on how to handle it: 因此,您必须选择如何处理它:

  • don't display non-ASCII chars 不显示非ASCII字符
  • display non-ASCII chars as some other type of representation 将非ASCII字符显示为其他某种类型的表示形式

The first choice would like this: 首选是这样的:

for t in facul[f][:5]:
    print ('%d:%s:%s' % (
            t[-2],
            t[2].encode('ascii', errors='ignore'),
            t[-1].encode('ascii', errors='ignore').strip()
            ))

While the second choice would replace ignore with something like replace , xmlcharrefreplace , or backslashreplace . 而第二个选择将取代ignore的东西,如replacexmlcharrefreplace ,或backslashreplace

Here's the reference . 这是参考

The error message is giving you two clues: first, the problem is in the statement 该错误消息为您提供了两个线索:首先,问题出在语句中

print '%d:%s:%s' %(t[-2], t[2], t[-1].strip())

Second, the problem is related to an encode operation. 第二,问题与encode操作有关。 If you don't remember what is meant by "encode", now would be a good time to re-read the Unicode HOWTO in the Python 2.7 docs. 如果您不记得“编码”的含义,那么现在将是重新阅读Python 2.7文档中的Unicode HOWTO的好时机。

It looks like your list t[] contains Unicode strings. 列表t[]似乎包含Unicode字符串。 The print() statement is emitting byte strings. print()语句正在发出字节字符串。 The conversion of Unicode strings to byte strings is encoding . Unicode字符串到字节字符串的转换是编码 Because you aren't specifying an encoding, Python is implicitly doing a default encoding. 由于您未指定编码,因此Python会隐式执行默认编码。 It uses the ascii codec, which cannot handle any accented or non-Latin characters. 它使用无法处理任何带重音或非拉丁字符的ascii编解码器。

Try splitting that print() statement into two parts. 尝试将print()语句分为两部分。 First, insert the unicode t[] values into a unicode format string. 首先,将unicode t[]值插入unicode格式的字符串中。 Note the use of u'' syntax. 请注意使用u''语法。 Second, encode the unicode string to UTF and print. 其次,将unicode字符串编码为UTF并打印。

s = u'%d:%s:%s' %(t[-2], t[2], t[-1].strip())
print s.encode('utf8')

(I haven't tested this change to your code. Let me know if it doesn't work.) (我尚未对您的代码进行此更改的测试。请告知是否无效。)

I think sys.setdefaultencoding() is probably a red herring, but I don't know your environment well. 我认为sys.setdefaultencoding()可能是个红色鲱鱼,但我对您的环境不太了解。

By the way, the statement, as you write it above, has unbalanced parentheses. 顺便说一句,正如您在上面所写,该语句具有不平衡的括号。 Did you drop a right parenthesis when you pasted in the code? 粘贴代码时是否删除了右括号?

print ('%d:%s:%s' %(t[-2], t[2], t[-1].strip())

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python mmh3:UnicodeEncodeError:&#39;ascii&#39;编解码器无法在位置0-14处编码字符:序数不在范围内(128) - Python mmh3: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128) Python2.7 UnicodeEncodeError:&#39;ascii&#39;编解码器不能编码0-11位的字符:序号不在范围内(128) - Python2.7 UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-11: ordinal not in range(128) UnicodeEncodeError:&#39;ascii&#39;编解码器无法在位置39编码字符u&#39;\\ xea&#39;:序数不在范围内(128) - UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 39: ordinal not in range(128) UnicodeEncodeError:&#39;ascii&#39;编解码器无法对位置34中的字符u&#39;\\ u05a0&#39;进行编码:序数不在范围内(128) - UnicodeEncodeError: 'ascii' codec can't encode character u'\u05a0' in position 34: ordinal not in range(128) UnicodeEncodeError:“ ascii”编解码器无法对不在范围内的字符进行编码(128) - UnicodeEncodeError: 'ascii' codec can't encode characters ordinal not in range(128) UnicodeEncodeError:&#39;ascii&#39;编解码器无法在位置0-3处编码字符:序数不在范围(128)中? - UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)? UnicodeEncodeError:“ ascii”编解码器无法对位置10-11中的字符进行编码:序数不在范围内(128) - UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-11: ordinal not in range(128) Canopy UnicodeEncodeError:“ ascii”编解码器无法对位置31-32中的字符进行编码:序数不在范围内(128) - Canopy UnicodeEncodeError: 'ascii' codec can't encode characters in position 31-32: ordinal not in range(128) UnicodeEncodeError:&#39;ascii&#39;编解码器不能编码位置0-6的字符:序数不在范围内(128) - UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-6: ordinal not in range(128) UnicodeEncodeError: &#39;ascii&#39; 编解码器无法对位置 0-9 中的字符进行编码:序号不在范围内 (128) - UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM