简体   繁体   中英

Passing Unicode strings to printf via ctypes

I'm testing out the built-in ctypes module for Python 3.x before devoting some time to make a wrapper for my existing C library.

I know that the stdlib functions in C will want ASCII input for anything labeled char * in the manual. However, my library is UTF-8 compliant, and I have tested it in C programs. I have also tested that the following code when compiling for C11 is valid and works as expected:

printf("Hello, %s!\n", u8"world");

However, if I try the same in Python, only the first character in my string is printed.

from ctypes import *

libc = CDLL("libc.so.6")

libc.printf(b"Hello, %s!\n", "world") # will print: Hello, w!

The Python 3 manual about Unicode implies that Python 3 uses UTF-8 as its character encoding which should avoid embedded NUL bytes that printf would see and stop reading. If I change the %s in my Python test to %ls , it prints as expected.

Is Python actually using UTF-16?

Python 3 (before 3.3) is using either UCS-16 or UCS-32 internally, per the docs :

Strings are stored internally as sequences of codepoints (to be precise as Py_UNICODE arrays). Depending on the way Python is compiled (either via --without-wide-unicode or --with-wide-unicode, with the former being the default) Py_UNICODE is either a 16-bit or 32-bit data type.

Py_UNICODE

This type represents the storage type which is used by Python internally as basis for holding Unicode ordinals. Python's default builds use a 16-bit type for Py_UNICODE and store Unicode values internally as UCS2. It is also possible to build a UCS4 version of Python (most recent Linux distributions come with UCS4 builds of Python). These builds then use a 32-bit type for Py_UNICODE and store Unicode data internally as UCS4.

What is happening with this line:

libc.printf(b"Hello, %s!\n", "world") # will print: Hello, w!

is that ctypes is marshaling byte strings as char* and Unicode strings as wchar_t* (UTF-16 or UTF-32, depending on OS). It doesn't really matter what Python is using internally. I'm on Windows, so I'll use cdll.msvcrt , but note that %s expects char* and %ls expects wchar_t* for printf :

from ctypes import *
cdll.msvcrt.printf(b'Hello, %s!\n', b'world') # byte string
cdll.msvcrt.printf(b'Hello, %ls!\n', 'world')  # Unicode string (UTF-16 or UTF-32)
cdll.msvcrt.printf(b'Hello, %s!\n', 'world')   # incorrect!

Output:

Hello, world!
Hello, world!
Hello, w!

Simply use a byte string for %s in your example:

libc.printf(b"Hello, %s!\n", b"world")

You can do your own explicit encoding if you want UTF-8:

#coding:utf8
from ctypes import *
cdll.msvcrt.printf(b'Hello, %s!\n', 'αßΓπΣσµτΦ'.encode('utf8'))

Output (after changing the Windows console via chcp 65001 , the UTF-8 code page):

Hello, αßΓπΣσµτΦ!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM