Python 為單個 Unicode 字符串返回長度為 2

Question

在 Python 2.7 中：

In [2]: utf8_str = '\xf0\x9f\x91\x8d'
In [3]: print(utf8_str)
👍
In [4]: unicode_str = utf8_str.decode('utf-8')
In [5]: print(unicode_str)
👍 
In [6]: unicode_str
Out[6]: u'\U0001f44d'
In [7]: len(unicode_str)
Out[7]: 2

由於unicode_str只包含一個 unicode 代碼點 (0x0001f44d)，為什么len(unicode_str)返回 2 而不是 1？

Answer 1

您的 Python 二進制文件是使用 UCS-2 支持（窄版本）編譯的，並且 BMP（基本多語言平面）之外的任何內容在內部都使用代理對表示。

這意味着在要求長度時，此類代碼點顯示為 2 個字符。

如果這很重要（ ./configure --enable-unicode=ucs4將啟用它），您將不得不重新編譯您的 Python 二進制文件以使用 UCS-4，或者升級到 Python 3.3 或更高版本，其中Python 的 Unicode 支持被徹底修改以使用一種可變寬度的 Unicode 類型，可根據包含的代碼點的要求在 ASCII、UCS-2 和 UCS-4 之間切換。

在 Python 2.7 和 3.0 - 3.2 版本上，您可以通過檢查sys.maxunicode值來檢測您的構建類型； 對於窄的 UCS-2 構建， 1114111 == 0x10FFFF是2^16-1 == 65535 == 0xFFFF ，對於寬的 UCS-4 構建， 1114111 == 0x10FFFF 。 在 Python 3.3 及更高版本中，它始終設置為 1114111。

演示：

# Narrow build
$ bin/python -c 'import sys; print sys.maxunicode, len(u"\U0001f44d"), list(u"\U0001f44d")'
65535 2 [u'\ud83d', u'\udc4d']
# Wide build
$ python -c 'import sys; print sys.maxunicode, len(u"\U0001f44d"), list(u"\U0001f44d")'
1114111 1 [u'\U0001f44d']

Python 為單個 Unicode 字符串返回長度為 2

問題描述

1 個解決方案

解決方案1
15 已采納 2015-03-17 21:24:24

Python 為單個 Unicode 字符串返回長度為 2

問題描述

1 個解決方案

解決方案1 15 已采納 2015-03-17 21:24:24

解決方案1
15 已采納 2015-03-17 21:24:24