Python Unicode十六進制字符串解碼

Question

我有以下字符串：u'\\ xe4 \\ xe7 \\ xec \\ xf7 \\ xe4 \\ xf9 \\ xec \\ xe9 \\ xf9 \\ xe9'在Windows-1255中編碼，我想將其解碼為Unicode代碼點（u'\\ u05d4 \\ u05d7 \\ u05dc \\ u05e7 \\ u05d4 \\ u05e9 \\ u05dc \\ u05d9 \\ u05e9 \\ u05d9'）。

>>> u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\cp1255.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

但是，如果我嘗試解碼字符串：'\\ xe4 \\ xe7 \\ xec \\ xf7 \\ xe4 \\ xf9 \\ xec \\ xe9 \\ xf9 \\ xe9'，我不會得到異常：

>>> '\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9'

如何解碼Unicode十六進制字符串（獲取異常的字符串）或將其轉換為可以解碼的常規字符串？

謝謝您的幫助。

Answer 1

這是因為\\xe4\\xe7\\xec\\xf7 \\xe4\\xf9\\xec\\xe9\\xf9\\xe9是字節數組，而不是Unicode字符串：字節表示有效的windows-1255字符，而不是有效的Unicode代碼點。

因此，在給它加上u ，Python解釋器無法解碼該字符串，甚至無法打印該字符串：

>>> print u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

因此，為了將字節數組轉換為UTF-8，您必須將其解碼為windows-1255 ，然后將其編碼為utf-8 ：

>>> '\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
                                               .encode('utf8')
'\xd7\x94\xd7\x97\xd7\x9c\xd7\xa7 \xd7\x94\xd7\xa9\xd7\x9c\xd7\x99\xd7\xa9\xd7\x99'

給出了希伯來語原文：

>>> print '\xd7\x94\xd7\x97\xd7\x9c\xd7\xa7 \xd7\x94\xd7\xa9\xd7\x9c\xd7\x99\xd7\xa9\xd7\x99'
החלק השלישי

Answer 2

我有以下字符串： u'\\xe4\\xe7\\xec\\xf7 \\xe4\\xf9\\xec\\xe9\\xf9\\xe9'在windows-1255編碼

那是自相矛盾的。 u表示它是Unicode字符串。 但是，如果您說它是以任何方式編碼的，則它必須是一個字節字符串（因為Unicode字符串只能被編碼為字節字符串）。

確實-您給定的實體- \\xe4\\xe7等-各自代表一個字節，只有通過給定的編碼， windows-1255才賦予它們各自的含義。

換句話說，如果您有一個u'\\xe4' ，則可以確保它與u'\ä'相同，而與u'\ה'相同，否則將與其他情況相同。

如果有任何機會從不知道此問題的來源獲得了錯誤的Unicode字符串，則可以從中獲得真正需要的字節字符串：借助latin1的“ 1：1編碼”。

所以

correct_str = u_str.encode("latin1")
# now every byte of the correct_str corresponds to the respective code point in the 0x80..0xFF range
correct_u_str = correct_str.decode("windows-1255")

Answer 3

嘗試這個

>> u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.encode('latin-1').decode('windows-1255')
u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9'

Answer 4

像這樣解碼

 >>> b'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
    u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9'

Python Unicode十六進制字符串解碼

問題描述

4 個解決方案

解決方案1
3 2014-11-01 09:11:10

解決方案2
3 已采納 2014-11-01 09:49:48

解決方案3
1 2014-11-01 09:12:48

解決方案4
-1 2014-11-01 09:10:27

Python Unicode十六進制字符串解碼

問題描述

4 個解決方案

解決方案1 3 2014-11-01 09:11:10

解決方案2 3 已采納 2014-11-01 09:49:48

解決方案3 1 2014-11-01 09:12:48

解決方案4 -1 2014-11-01 09:10:27

解決方案1
3 2014-11-01 09:11:10

解決方案2
3 已采納 2014-11-01 09:49:48

解決方案3
1 2014-11-01 09:12:48

解決方案4
-1 2014-11-01 09:10:27