sys.maxunicode是什么意思？

Question

CPython stores unicode strings as either utf-16 or utf-32 internally depending on compile options. CPython在内部将unicode字符串存储为utf-16或utf-32，具体取决于编译选项。 In utf-16 builds of Python string slicing, iteration, and len seem to work on code units, not code points, so that multibyte characters behave strangely. 在utf-16中，Python字符串切片，迭代和len构建似乎适用于代码单元，而不是代码点，因此多字节字符的行为很奇怪。

Eg, on CPython 2.6 with sys.maxunicode = 65535: 例如，在带有sys.maxunicode = 65535的CPython 2.6上：

>>> char = u'\U0001D49E'
>>> len(char)
2
>>> char[0:1]
u'\uu835'
>>> char[1:2]
u'\udc9e'

According to the Python documentation, sys.maxunicode is "An integer giving the largest supported code point for a Unicode character." 根据Python文档， sys.maxunicode是“为Unicode字符提供最大支持代码点的整数”。

Does this mean that unicode operations aren't guranteed to work on code points beyond sys.maxunicode ? 这是否意味着unicode操作不能保证在sys.maxunicode之外的代码点上工作？ If I want to work with characters outside the BMP I either have to use a utf-32 build or write my own portable unicode operations? 如果我想使用BMP之外的字符，我要么使用utf-32构建还是编写自己的便携式unicode操作？

I came across this problem in How to iterate over Unicode characters in Python 3? 我在如何迭代Python 3中的Unicode字符时遇到了这个问题？

Answer 1

Characters beyond sys.maxunicode=65535 are stored internally using UTF-16 surrogates. sys.maxunicode=65535之外的字符使用UTF-16代理在内部存储。 Yes you have to deal with this yourself or use a wide build. 是的，你必须自己处理或使用广泛的构建。 Even with a wide build you also may have to deal with single characters represented by a combination of code points. 即使使用广泛的构建，您也可能必须处理由代码点组合表示的单个字符。 For example: 例如：

>>> print('a\u0301')
á
>>> print('\xe1')
á

The first uses a combining accent character and the second doesn't. 第一个使用组合重音字符而第二个不使用。 Both print the same. 两者都打印相同。 You can use unicodedata.normalize to convert the forms. 您可以使用unicodedata.normalize转换表单。

sys.maxunicode是什么意思？

问题描述

1 个解决方案

解决方案1
3 2011-09-21 06:13:40

sys.maxunicode是什么意思？

问题描述

1 个解决方案

解决方案1 3 2011-09-21 06:13:40

解决方案1
3 2011-09-21 06:13:40