How to find out if Python is compiled with UCS-2 or UCS-4?

Question

Just what the title says.

$ ./configure --help | grep -i ucs
  --enable-unicode[=ucs[24]]

Searching the official documentation, I found this:

sys.maxunicode : An integer giving the largest supported code point for a Unicode character. The value of this depends on the configuration option that specifies whether Unicode characters are stored as UCS-2 or UCS-4.

What is not clear here is - which value(s) correspond to UCS-2 and UCS-4.

The code is expected to work on Python 2.6+.

Answer 1

When built with --enable-unicode=ucs4:

>>> import sys
>>> print sys.maxunicode
1114111

When built with --enable-unicode=ucs2:

>>> import sys
>>> print sys.maxunicode
65535

Answer 2

It's 0xFFFF (or 65535) for UCS-2, and 0x10FFFF (or 1114111) for UCS-4:

Py_UNICODE
PyUnicode_GetMax(void)
{
#ifdef Py_UNICODE_WIDE
    return 0x10FFFF;
#else
    /* This is actually an illegal character, so it should
       not be passed to unichr. */
    return 0xFFFF;
#endif
}

The maximum character in UCS-4 mode is defined by the maxmimum value representable in UTF-16.

Answer 3

I had this same issue once. I documented it for myself on my wiki at

http://arcoleo.org/dsawiki/Wiki.jsp?page=Python%20UTF%20-%20UCS2%20or%20UCS4

I wrote -

import sys
sys.maxunicode > 65536 and 'UCS4' or 'UCS2'

Answer 4

sysconfig will tell the unicode size from the configuration variables of python.

The buildflags can be queried like this.

Python 2.7:

import sysconfig
sysconfig.get_config_var('Py_UNICODE_SIZE')

Python 2.6:

import distutils
distutils.sysconfig.get_config_var('Py_UNICODE_SIZE')

Answer 5

I had the same issue and found a semi-official piece of code that does exactly that and may be interesting for people with the same issue: https://bitbucket.org/pypa/wheel/src/cf4e2d98ecb1f168c50a6de496959b4a10c6b122/wheel/pep425tags.py?at=default&fileviewer=file-view-default#pep425tags.py-83:89 .

It comes from the wheel project which needs to check if the python is compiled with ucs-2 or ucs-4 because it will change the name of the binary file generated.

Answer 6

Another way is to create an Unicode array and look at the itemsize:

import array
bytes_per_char = array.array('u').itemsize

Quote from the array docs :

The 'u' typecode corresponds to Python's unicode character. On narrow Unicode builds this is 2-bytes, on wide builds this is 4-bytes.

Note that the distinction between narrow and wide Unicode builds is dropped from Python 3.3 onward, see PEP393 . The 'u' typecode for array is deprecated since 3.3 and scheduled for removal in Python 4.0.

Answer 7

65535 is UCS-2:

Thus code point U+0000 is encoded as the number 0, and U+FFFF is encoded as 65535 (which is FFFF16 in hexadecimal).

How to find out if Python is compiled with UCS-2 or UCS-4?

Question

7 answers

solution1
120 ACCPTED 2009-09-18 19:33:45

solution2
19 2009-09-18 19:20:44

solution3
11 2009-09-20 02:50:11

solution4
8 2016-03-04 16:40:42

solution5
1 2016-08-17 07:28:02

solution6
1 2016-09-07 11:28:30

solution7
0 2009-09-18 19:14:20

How to find out if Python is compiled with UCS-2 or UCS-4?

Question

7 answers

solution1 120 ACCPTED 2009-09-18 19:33:45

solution2 19 2009-09-18 19:20:44

solution3 11 2009-09-20 02:50:11

solution4 8 2016-03-04 16:40:42

solution5 1 2016-08-17 07:28:02

solution6 1 2016-09-07 11:28:30

solution7 0 2009-09-18 19:14:20

solution1
120 ACCPTED 2009-09-18 19:33:45

solution2
19 2009-09-18 19:20:44

solution3
11 2009-09-20 02:50:11

solution4
8 2016-03-04 16:40:42

solution5
1 2016-08-17 07:28:02

solution6
1 2016-09-07 11:28:30

solution7
0 2009-09-18 19:14:20