简体   繁体   English

Python 为单个 Unicode 字符串返回长度为 2

[英]Python returns length of 2 for single Unicode character string

In Python 2.7:在 Python 2.7 中:

In [2]: utf8_str = '\xf0\x9f\x91\x8d'
In [3]: print(utf8_str)
👍
In [4]: unicode_str = utf8_str.decode('utf-8')
In [5]: print(unicode_str)
👍 
In [6]: unicode_str
Out[6]: u'\U0001f44d'
In [7]: len(unicode_str)
Out[7]: 2

Since unicode_str only contains a single unicode code point (0x0001f44d), why does len(unicode_str) return 2 instead of 1?由于unicode_str只包含一个 unicode 代码点 (0x0001f44d),为什么len(unicode_str)返回 2 而不是 1?

Your Python binary was compiled with UCS-2 support (a narrow build) and internally anything outside of the BMP (Basic Multilingual Plane) is represented using a surrogate pair .您的 Python 二进制文件是使用 UCS-2 支持(版本)编译的,并且 BMP(基本多语言平面)之外的任何内容在内部都使用代理对表示。

That means such codepoints show up as 2 characters when asking for the length.这意味着在要求长度时,此类代码点显示为 2 个字符。

You'll have to recompile your Python binary to use UCS-4 instead if this matters ( ./configure --enable-unicode=ucs4 will enable it), or upgrade to Python 3.3 or newer, where Python's Unicode support was overhauled to use a variable-width Unicode type that switches between ASCII, UCS-2 and UCS-4 as required by the codepoints contained.如果这很重要( ./configure --enable-unicode=ucs4将启用它),您将不得不重新编译您的 Python 二进制文件以使用 UCS-4,或者升级到 Python 3.3 或更高版本,其中Python 的 Unicode 支持被彻底修改以使用一种可变宽度的 Unicode 类型,可根据包含的代码点的要求在 ASCII、UCS-2 和 UCS-4 之间切换。

On Python versions 2.7 and 3.0 - 3.2, you can detect what kind of build you have by inspecting thesys.maxunicode value ;在 Python 2.7 和 3.0 - 3.2 版本上,您可以通过检查sys.maxunicode来检测您的构建类型; it'll be 2^16-1 == 65535 == 0xFFFF for a narrow UCS-2 build, 1114111 == 0x10FFFF for a wide UCS-4 build.对于窄的 UCS-2 构建, 1114111 == 0x10FFFF2^16-1 == 65535 == 0xFFFF ,对于宽的 UCS-4 构建, 1114111 == 0x10FFFF In Python 3.3 and up it is always set to 1114111.在 Python 3.3 及更高版本中,它始终设置为 1114111。

Demo:演示:

# Narrow build
$ bin/python -c 'import sys; print sys.maxunicode, len(u"\U0001f44d"), list(u"\U0001f44d")'
65535 2 [u'\ud83d', u'\udc4d']
# Wide build
$ python -c 'import sys; print sys.maxunicode, len(u"\U0001f44d"), list(u"\U0001f44d")'
1114111 1 [u'\U0001f44d']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM