简体   繁体   English

sys.maxunicode是什么意思?

[英]What does sys.maxunicode mean?

CPython stores unicode strings as either utf-16 or utf-32 internally depending on compile options. CPython在内部将unicode字符串存储为utf-16或utf-32,具体取决于编译选项。 In utf-16 builds of Python string slicing, iteration, and len seem to work on code units, not code points, so that multibyte characters behave strangely. 在utf-16中,Python字符串切片,迭代和len构建似乎适用于代码单元,而不是代码点,因此多字节字符的行为很奇怪。

Eg, on CPython 2.6 with sys.maxunicode = 65535: 例如,在带有sys.maxunicode = 65535的CPython 2.6上:

>>> char = u'\U0001D49E'
>>> len(char)
2
>>> char[0:1]
u'\uu835'
>>> char[1:2]
u'\udc9e'

According to the Python documentation, sys.maxunicode is "An integer giving the largest supported code point for a Unicode character." 根据Python文档, sys.maxunicode是“为Unicode字符提供最大支持代码点的整数”。

Does this mean that unicode operations aren't guranteed to work on code points beyond sys.maxunicode ? 这是否意味着unicode操作不能保证在sys.maxunicode之外的代码点上工作? If I want to work with characters outside the BMP I either have to use a utf-32 build or write my own portable unicode operations? 如果我想使用BMP之外的字符,我要么使用utf-32构建还是编写自己的便携式unicode操作?

I came across this problem in How to iterate over Unicode characters in Python 3? 我在如何迭代Python 3中的Unicode字符时遇到了这个问题

Characters beyond sys.maxunicode=65535 are stored internally using UTF-16 surrogates. sys.maxunicode=65535之外的字符使用UTF-16代理在内部存储。 Yes you have to deal with this yourself or use a wide build. 是的,你必须自己处理或使用广泛的构建。 Even with a wide build you also may have to deal with single characters represented by a combination of code points. 即使使用广泛的构建,您也可能必须处理由代码点组合表示的单个字符。 For example: 例如:

>>> print('a\u0301')
á
>>> print('\xe1')
á

The first uses a combining accent character and the second doesn't. 第一个使用组合重音字符而第二个不使用。 Both print the same. 两者都打印相同。 You can use unicodedata.normalize to convert the forms. 您可以使用unicodedata.normalize转换表单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM