为什么Python 2.x会使用字符串格式+ Unicode引发异常？

Question

I have the following piece of code. 我有以下代码。 The last line throws an error. 最后一行引发错误。 Why is that? 这是为什么？

class Foo(object):

    def __unicode__(self):
        return u'\u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45'

    def __str__(self):
        return self.__unicode__().encode('utf-8')

print "this works %s" % (u'asdf')
print "this works %s" % (Foo(),)
print "this works %s %s" % (Foo(), 'asdf')
print

print "this also works {0} {1}".format(Foo(), u'asdf')
print
print "this should break %s %s" % (Foo(), u'asdf')

The error is "UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 18: ordinal not in range(128)" 错误为“ UnicodeDecodeError：'ascii'编解码器无法解码位置18的字节0xe6：序数不在范围（128）中”

Answer 1

Python 2 implicitly will try and encode unicode values to strings when you mix unicode and string objects, or it will try and decode byte strings to unicode. 当您混合unicode和字符串对象时，Python 2隐式地尝试将unicode值编码为字符串，或者尝试将字节字符串解码为unicode。

You are mixing unicode, byte strings and a custom object, and you are triggering a sequence of encodings and decodings that doesn't mix. 您正在混合unicode，字节字符串和一个自定义对象，并且正在触发一系列不混合的编码和解码。

In this case, your Foo() value is interpolated as a string ( str(Foo()) is used), and the u'asdf' interpolation triggers a decode of the template so far (so with the UTF-8 Foo() value) to interpolate the unicode string. 在这种情况下，您的Foo()值将作为字符串插值（使用str(Foo()) ），而u'asdf'插值将触发模板的解码（到目前为止，使用UTF-8 Foo()值）以内插unicode字符串。 This decode fails as the ASCII codec cannot decode the \\xe6\\x9e\\x97 UTF-8 byte sequence already interpolated. 此解码失败，因为ASCII编解码器无法解码已经插值的\\xe6\\x9e\\x97 UTF-8字节序列。

You should always explicitly encode Unicode values to bytestrings or decode byte strings to Unicode before mixing types, as the corner cases are complex. 混合类型之前，您应始终将Unicode值显式编码为字节字符串或将字节字符串解码为Unicode，因为特殊情况非常复杂。

Explicitly converting to unicode() works: 显式转换为unicode()可以：

>>> print "this should break %s %s" % (unicode(Foo()), u'asdf')
this should break 林覺民謝冰心故居 asdf

as the output is turned into a unicode string: 将输出转换为unicode字符串：

>>> "this should break %s %s" % (unicode(Foo()), u'asdf')
u'this should break \u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45 asdf'

while otherwise you'd end up with a byte string: 否则，您将得到一个字节字符串：

>>> "this should break %s %s" % (Foo(), 'asdf')
'this should break \xe6\x9e\x97\xe8\xa6\xba\xe6\xb0\x91\xe8\xac\x9d\xe5\x86\xb0\xe5\xbf\x83\xe6\x95\x85\xe5\xb1\x85 asdf'

(note that asdf is left a bytestring too). （请注意， asdf也留有字节串）。

Alternatively, use a unicode template : 或者，使用unicode 模板：

>>> u"this should break %s %s" % (Foo(), u'asdf')
u'this should break \u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45 asdf'

为什么Python 2.x会使用字符串格式+ Unicode引发异常？

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-03-20 15:35:50

为什么Python 2.x会使用字符串格式+ Unicode引发异常？

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-03-20 15:35:50

解决方案1
3 已采纳 2014-03-20 15:35:50