如何通過在字符串中使用 \\u 或 \\U 轉義來正確表示 python3 (3.6.1+) 中的補充 unicode 字符

Question

最近我在學習 python 並且在 python 3 中遇到了 unicode 轉義文字的問題。

似乎與 Java 一樣，\\u 轉義符被解釋為 Java 使用的 UTF-16 代碼點，但問題來了：

例如，如果我嘗試放置一個 3 個字節的 utf-8 字符，如“♬”（ https://unicode-table.com/en/266C/ ），甚至是補充的 unicode 字符，如“𠜎”（ https://unicode -table.com/en/2070E/ ) 格式為 \\uXXXX 或 \\UXXXXXXXX 的普通字符串如下：

print('\u00E2\u99AC')  # UTF-8, messy code for sure
print('\U00E299AC')    # UTF-8, with 8 bytes \U, (unicode error) for sure
print('\u266C')        # UTF-16 BE, music note appeares
# from which I suppose \u and \U function the same way they should do in Java
# (may be a little different since they function like macro in Java, and can be useed in comments)

# However, while print('\u266C') gives me '♬'，'\u266C' == '♬' is equal to false
# which is true in Java semantics.
# Further more, print('\UD841DF0E') didn't give me '𠜎' : (unicode error) 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character
# which I suppose it should be, so it appears to me that I may get it wrong
# Here again : print('\uD841\uDF0E')  # Error, 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

print('\xD8\x41\xDF\x0E')  # also tried this, messy code
# maybe UTF-16 LE?
print('\u41D8\u0EDF')  # messy code
print('\U41D80EDF')  # error

所以，我可以看到python“不支持補充轉義文字”，它的行為也很奇怪。

好吧，我已經知道解碼和編碼這些字符的正確方法：

s_decoded = '\\xe2\\x99\\xac'.encode().decode('unicode-escape')\
               .encode('latin-1').decode('utf-8')
print(b'\xf0\xa0\x9c\x8e'.decode('utf-8'))
print(b'\xd8\x41\xdf\x0e'.decode('utf-16 be'))
assert s_decoded == '♬'

但是仍然不知道如何正確使用 \\u \u0026amp; \\U 轉義文字。 希望有人能指出我做錯了什么以及它與 Java 的方式有何不同，謝謝！

對了，我的環境是PyCharm win，python 3.6.1，源碼編碼為UTF-8

Answer 1

Python 3.6.3：

>>> print('\u266c') # U+266C
♬
>>> print('\U0002070E') # U+2070E.  Python is not Java
𠜎
>>> '\u266c' == '♬'
True
>>> '\U0002070E' == '𠜎'
True

如何通過在字符串中使用 \\u 或 \\U 轉義來正確表示 python3 (3.6.1+) 中的補充 unicode 字符

問題描述

1 個解決方案

解決方案1
1 2018-02-08 05:21:58

如何通過在字符串中使用 \\u 或 \\U 轉義來正確表示 python3 (3.6.1+) 中的補充 unicode 字符

問題描述

1 個解決方案

解決方案1 1 2018-02-08 05:21:58

解決方案1
1 2018-02-08 05:21:58