简体   繁体   English

Python编码unicode字符串

[英]Python encoding over unicode strings

So in a python terminal I type the following: 所以在python终端中我键入以下内容:

>>> s = "γειά"       ## it just means 'hi' in Greek
>>> s
'\x9a\x9c\xa0\xe1'   ## What is this? - Is it utf-encoding? Is it ascii escaped?
>>> print s
γειά

and now the fun part: 现在有趣的部分:

>>> a = u"γειά"
>>> a
u'\u03b3\u03b5\u03b9\u03ac'    # Again what is this? utf-8 encoded? If so, how?
>>> print a
γειά

I am totally confused over encodings and particularly on utf-8 encoded strings and/or ascii encoded strings. 我对编码特别是utf-8编码的字符串和/或ascii编码的字符串感到困惑。 What would be the difference between the above 2 snippets and how do they tie-in the unicode function? 上述两个片段之间的区别是什么?它们如何与unicode功能相结合?

>>> result = unicode(s)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9a in position 0: ordinal
                     not in range(128)

>>> result = unicode(s, 'utf-8')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 0: invalid s
                     tart byte

Could someone explain to me what's happening here? 有人可以向我解释这里发生了什么吗? Thanks in advance. 提前致谢。

On your first attempt you're seeing the encoded version of the string, and not in utf-8 at all: 在您第一次尝试时,您将看到字符串的编码版本,而不是utf-8:

>>> s='\x9a\x9c\xa0\xe1'
>>> s.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 0: invalid start byte

It is encoding in whatever encode your shell is using. 它在shell使用的任何编码中进行编码。

On your second example, you're creating an unicode string. 在第二个示例中,您将创建一个unicode字符串。 Python, armed with your shell encoding, is able to decode it from the input and store it as unicode codepoints ( \γ\ε\ι\ά ). 拥有shell编码的Python能够从输入中解码它并将其存储为unicode代码点\γ\ε\ι\ά )。 Later, when you print it, Python also knows your shell's encoding and is able to encode it from unicode to actual bytes. 稍后,当您print它时,Python也知道您的shell的编码,并能够将unicode 编码为实际字节。

About your third example, you're using unicode function explicitly. 关于你的第三个例子,你明确地使用了unicode函数。 Which when used without an encoding as argument, it will use ascii as default. 在没有编码作为参数的情况下使用时,它将使用ascii作为默认值。 As there's no way ascii support Greek characters, Python is complaining about that. 由于ascii无法支持希腊字符,因此Python抱怨这一点。

Bottom line, you need to know what encoding your console is using to figure out exactly what Python is doing with your code. 最重要的是,您需要知道控制台正在使用什么编码来确定Python对您的代码所做的事情。 If you are on Windows you can do this with chcp command. 如果您使用的是Windows, chcp可以使用chcp命令执行此操作。 On Linux you can use locale command. 在Linux上,您可以使用locale命令。

Of course I forgot the most important advice ever :P. 当然,我忘记了有史以来最重要的建议:P。 As @thg435 suggested this is a must read : Unicode by Joel 正如@ thg435所说,这是必读的Joel的Unicode

Also is worth mentioning that a lot of these changes dramatically in Python 3. 另外值得一提的是,很多这些变化在Python 3中都有很大的改变。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM