Python編碼unicode字符串

Question

所以在python終端中我鍵入以下內容：

>>> s = "γειά"       ## it just means 'hi' in Greek
>>> s
'\x9a\x9c\xa0\xe1'   ## What is this? - Is it utf-encoding? Is it ascii escaped?
>>> print s
γειά

現在有趣的部分：

>>> a = u"γειά"
>>> a
u'\u03b3\u03b5\u03b9\u03ac'    # Again what is this? utf-8 encoded? If so, how?
>>> print a
γειά

我對編碼特別是utf-8編碼的字符串和/或ascii編碼的字符串感到困惑。 上述兩個片段之間的區別是什么？它們如何與unicode功能相結合？

>>> result = unicode(s)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9a in position 0: ordinal
                     not in range(128)

>>> result = unicode(s, 'utf-8')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 0: invalid s
                     tart byte

有人可以向我解釋這里發生了什么嗎？ 提前致謝。

Answer 1

在您第一次嘗試時，您將看到字符串的編碼版本，而不是utf-8：

>>> s='\x9a\x9c\xa0\xe1'
>>> s.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 0: invalid start byte

它在shell使用的任何編碼中進行編碼。

在第二個示例中，您將創建一個unicode字符串。 擁有shell編碼的Python能夠從輸入中解碼它並將其存儲為unicode代碼點 （ \γ\ε\ι\ά ）。 稍后，當您print它時，Python也知道您的shell的編碼，並能夠將其從unicode 編碼為實際字節。

關於你的第三個例子，你明確地使用了unicode函數。 在沒有編碼作為參數的情況下使用時，它將使用ascii作為默認值。 由於ascii無法支持希臘字符，因此Python抱怨這一點。

最重要的是，您需要知道控制台正在使用什么編碼來確定Python對您的代碼所做的事情。 如果您使用的是Windows， chcp可以使用chcp命令執行此操作。 在Linux上，您可以使用locale命令。

當然，我忘記了有史以來最重要的建議：P。 正如@ thg435所說，這是必讀的 ： Joel的Unicode

另外值得一提的是，很多這些變化在Python 3中都有很大的改變。

Python編碼unicode字符串

問題描述

1 個解決方案

解決方案1
2 2014-02-26 11:57:23

Python編碼unicode字符串

問題描述

1 個解決方案

解決方案1 2 2014-02-26 11:57:23

解決方案1
2 2014-02-26 11:57:23