简体   繁体   English

使用UTF-8的Python Unicode编码

[英]Python unicode encoding using UTF-8

I was following through python's tutorial on unicode and I've got a simple question to ask: When I open up a python shell and type: 我一直在阅读有关unicode的python 教程 ,并且有一个简单的问题要问:当我打开python shell并键入时:

>>> unicode('\x80abc')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal
not in range(128)

I get the above error as expected since python attempts to convert the byte \\x80 to unicode using the ascii encoding which can go as far as 127. (\\x80 is 128). 由于python尝试使用ascii编码将字节\\ x80转换为unicode,因此我得到了上述错误,该错误可以达到127。(\\ x80为128)。

However if I try again using th utf-8 encoding, I again get an error although somewhat different: 但是,如果我再次尝试使用utf-8编码,尽管有所不同,但我仍然收到错误消息:

>>> unicode('\x80abc', 'utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid s
tart byte

What is going on here and how should I properly go about it? 这是怎么回事,我应该如何正确处理?

It just happened that \\x80 is not a valid byte in UTF-8 either. 碰巧\\x80也不是UTF-8中的有效字节。

Take a look at the charset for UTF-8 and see that the one byte codes finish in \\x7f . 看一下UTF-8的字符集 ,看一字节代码在\\x7f完成。

If you want to prove your example, try with latin1 and the ñ character: unicode('\\xf1abc','latin1') . 如果要证明您的示例,请尝试使用latin1ñ字符: unicode('\\xf1abc','latin1') Without the encoding it will fail and with it it'll pass. 如果没有编码,它将失败并通过。

First, '\\x80abc' is a byte string (in Python < 3). 首先, '\\x80abc'是一个字节字符串(在Python <3中)。 If you want to convert a byte string to a unicode string you have two options: Either you reinterpret all bytes as single-byte unicode characters (you can simply prepend a u to the string literal then: u'\\x80abc' ) or you assume that the bytes string is a unicode string encoded using a particular codec (like ASCII, Latin1, UTF-8, etc.); 如果要将字节字符串转换为unicode字符串,则有两个选择: 所有字节重新解释为单字节unicode字符(您可以简单地将u前缀为字符串文字,然后: u'\\x80abc' ),或者假设字节字符串是使用特定编解码器(如ASCII,Latin1,UTF-8等)编码的unicode字符串; then you would go as you attempted: by decoding it. 那么您将按照自己的尝试进行:通过解码

Calling unicode() is an explicit decoding. 调用unicode()是一个显式解码。 And as Paulo pointed out, a \\80 is not valid in UTF-8, as it is invalid in ASCII. 正如Paulo所指出的, \\80在UTF-8中无效,因为在ASCII中无效。 You might try Latin1, though, this will work as it allows a \\x80 byte in its stream. 但是,您可以尝试使用Latin1,因为它允许在其流中使用\\x80字节,因此可以使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM