[英]Why does Python 2.x throw an exception with string formatting + unicode?
I have the following piece of code. 我有以下代码。 The last line throws an error.
最后一行引发错误。 Why is that?
这是为什么?
class Foo(object):
def __unicode__(self):
return u'\u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45'
def __str__(self):
return self.__unicode__().encode('utf-8')
print "this works %s" % (u'asdf')
print "this works %s" % (Foo(),)
print "this works %s %s" % (Foo(), 'asdf')
print
print "this also works {0} {1}".format(Foo(), u'asdf')
print
print "this should break %s %s" % (Foo(), u'asdf')
The error is "UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 18: ordinal not in range(128)" 错误为“ UnicodeDecodeError:'ascii'编解码器无法解码位置18的字节0xe6:序数不在范围(128)中”
Python 2 implicitly will try and encode unicode values to strings when you mix unicode and string objects, or it will try and decode byte strings to unicode. 当您混合unicode和字符串对象时,Python 2隐式地尝试将unicode值编码为字符串, 或者尝试将字节字符串解码为unicode。
You are mixing unicode, byte strings and a custom object, and you are triggering a sequence of encodings and decodings that doesn't mix. 您正在混合unicode,字节字符串和一个自定义对象,并且正在触发一系列不混合的编码和解码。
In this case, your Foo()
value is interpolated as a string ( str(Foo())
is used), and the u'asdf'
interpolation triggers a decode of the template so far (so with the UTF-8 Foo()
value) to interpolate the unicode string. 在这种情况下,您的
Foo()
值将作为字符串插值(使用str(Foo())
),而u'asdf'
插值将触发模板的解码 (到目前为止,使用UTF-8 Foo()
值)以内插unicode字符串。 This decode fails as the ASCII codec cannot decode the \\xe6\\x9e\\x97
UTF-8 byte sequence already interpolated. 此解码失败,因为ASCII编解码器无法解码已经插值的
\\xe6\\x9e\\x97
UTF-8字节序列。
You should always explicitly encode Unicode values to bytestrings or decode byte strings to Unicode before mixing types, as the corner cases are complex. 混合类型之前,您应始终将Unicode值显式编码为字节字符串或将字节字符串解码为Unicode,因为特殊情况非常复杂。
Explicitly converting to unicode()
works: 显式转换为
unicode()
可以:
>>> print "this should break %s %s" % (unicode(Foo()), u'asdf')
this should break 林覺民謝冰心故居 asdf
as the output is turned into a unicode string: 将输出转换为unicode字符串:
>>> "this should break %s %s" % (unicode(Foo()), u'asdf')
u'this should break \u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45 asdf'
while otherwise you'd end up with a byte string: 否则,您将得到一个字节字符串:
>>> "this should break %s %s" % (Foo(), 'asdf')
'this should break \xe6\x9e\x97\xe8\xa6\xba\xe6\xb0\x91\xe8\xac\x9d\xe5\x86\xb0\xe5\xbf\x83\xe6\x95\x85\xe5\xb1\x85 asdf'
(note that asdf
is left a bytestring too). (请注意,
asdf
也留有字节串)。
Alternatively, use a unicode template : 或者,使用unicode 模板 :
>>> u"this should break %s %s" % (Foo(), u'asdf')
u'this should break \u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45 asdf'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.