I'm testing out the built-in ctypes module for Python 3.x before devoting some time to make a wrapper for my existing C library.
I know that the stdlib functions in C will want ASCII input for anything labeled char *
in the manual. However, my library is UTF-8 compliant, and I have tested it in C programs. I have also tested that the following code when compiling for C11 is valid and works as expected:
printf("Hello, %s!\n", u8"world");
However, if I try the same in Python, only the first character in my string is printed.
from ctypes import *
libc = CDLL("libc.so.6")
libc.printf(b"Hello, %s!\n", "world") # will print: Hello, w!
The Python 3 manual about Unicode implies that Python 3 uses UTF-8 as its character encoding which should avoid embedded NUL
bytes that printf
would see and stop reading. If I change the %s
in my Python test to %ls
, it prints as expected.
Is Python actually using UTF-16?
Python 3 (before 3.3) is using either UCS-16 or UCS-32 internally, per the docs :
Strings are stored internally as sequences of codepoints (to be precise as Py_UNICODE arrays). Depending on the way Python is compiled (either via --without-wide-unicode or --with-wide-unicode, with the former being the default) Py_UNICODE is either a 16-bit or 32-bit data type.
Py_UNICODE
This type represents the storage type which is used by Python internally as basis for holding Unicode ordinals. Python's default builds use a 16-bit type for Py_UNICODE and store Unicode values internally as UCS2. It is also possible to build a UCS4 version of Python (most recent Linux distributions come with UCS4 builds of Python). These builds then use a 32-bit type for Py_UNICODE and store Unicode data internally as UCS4.
What is happening with this line:
libc.printf(b"Hello, %s!\n", "world") # will print: Hello, w!
is that ctypes
is marshaling byte strings as char*
and Unicode strings as wchar_t*
(UTF-16 or UTF-32, depending on OS). It doesn't really matter what Python is using internally. I'm on Windows, so I'll use cdll.msvcrt
, but note that %s
expects char*
and %ls
expects wchar_t*
for printf
:
from ctypes import *
cdll.msvcrt.printf(b'Hello, %s!\n', b'world') # byte string
cdll.msvcrt.printf(b'Hello, %ls!\n', 'world') # Unicode string (UTF-16 or UTF-32)
cdll.msvcrt.printf(b'Hello, %s!\n', 'world') # incorrect!
Output:
Hello, world!
Hello, world!
Hello, w!
Simply use a byte string for %s
in your example:
libc.printf(b"Hello, %s!\n", b"world")
You can do your own explicit encoding if you want UTF-8:
#coding:utf8
from ctypes import *
cdll.msvcrt.printf(b'Hello, %s!\n', 'αßΓπΣσµτΦ'.encode('utf8'))
Output (after changing the Windows console via chcp 65001
, the UTF-8 code page):
Hello, αßΓπΣσµτΦ!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.