[英]What does sys.maxunicode mean?
CPython stores unicode strings as either utf-16 or utf-32 internally depending on compile options. CPython在内部将unicode字符串存储为utf-16或utf-32,具体取决于编译选项。 In utf-16 builds of Python string slicing, iteration, and
len
seem to work on code units, not code points, so that multibyte characters behave strangely. 在utf-16中,Python字符串切片,迭代和
len
构建似乎适用于代码单元,而不是代码点,因此多字节字符的行为很奇怪。
Eg, on CPython 2.6 with sys.maxunicode
= 65535: 例如,在带有
sys.maxunicode
= 65535的CPython 2.6上:
>>> char = u'\U0001D49E'
>>> len(char)
2
>>> char[0:1]
u'\uu835'
>>> char[1:2]
u'\udc9e'
According to the Python documentation, sys.maxunicode
is "An integer giving the largest supported code point for a Unicode character." 根据Python文档,
sys.maxunicode
是“为Unicode字符提供最大支持代码点的整数”。
Does this mean that unicode
operations aren't guranteed to work on code points beyond sys.maxunicode
? 这是否意味着
unicode
操作不能保证在sys.maxunicode
之外的代码点上工作? If I want to work with characters outside the BMP I either have to use a utf-32 build or write my own portable unicode
operations? 如果我想使用BMP之外的字符,我要么使用utf-32构建还是编写自己的便携式
unicode
操作?
I came across this problem in How to iterate over Unicode characters in Python 3? 我在如何迭代Python 3中的Unicode字符时遇到了这个问题?
Characters beyond sys.maxunicode=65535
are stored internally using UTF-16 surrogates. sys.maxunicode=65535
之外的字符使用UTF-16代理在内部存储。 Yes you have to deal with this yourself or use a wide build. 是的,你必须自己处理或使用广泛的构建。 Even with a wide build you also may have to deal with single characters represented by a combination of code points.
即使使用广泛的构建,您也可能必须处理由代码点组合表示的单个字符。 For example:
例如:
>>> print('a\u0301')
á
>>> print('\xe1')
á
The first uses a combining accent character and the second doesn't. 第一个使用组合重音字符而第二个不使用。 Both print the same.
两者都打印相同。 You can use
unicodedata.normalize
to convert the forms. 您可以使用
unicodedata.normalize
转换表单。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.