简体   繁体   中英

Python to C++ Character encoding

I have a C++ program that uses the Python C/API to call Python scripts for DB info, but the data received is not encoded in the right way. This is in France, so my data has accents and other non-English characters.

In a python terminal with the sys.defaultencoding set to "utf-8", an example:

    >>> robin = 'testé'
    >>> robin
    'test\x82'
    >>> print robin
    testé
    >>> str(robin)
    'test\x82'

If I call:

    PyString_AsString(PyObject_Repr(PyObject_GetAttrString(/*PyObject of my Py_Init*/, "robin")));

I get a char* filled with the folowing: test\\x82

And creating a string or wstring from that yields the same result.

I would like to be able to create a string that says "testé" , and I'm guessing that starts with being able to output the variable correctly in the python terminal, as in:

    >>> robin = 'testé'
    >>> robin
    'testé'

I tried encode() decode(), sys.setdefaultencoding, sys.stdout.encoding, and even some force_text and force_bytes from Django. Nothing seems to be able to get me a standard C++ string with my actual characters in it. Any help would be greatly appreciated.

FYI - Python 2.7, Windows 8 x64, VS2012 and C++9

EDIT to answer to comments:

    >>> import sys
    >>> reload(sys)
    <module 'sys' (built-in)>
    >>> sys.setdefaultencoding('utf-8')
    >>> sys.getdefaultencoding()
    'utf-8'
    >>> robin = 'testé'
    >>> robin
    'test\x82'
    >>> print robin
    testé

I just want whatever 'print' does to display the information correctly...

This is not as simple as it seems, I was wrong, acute e in utf-8 is c3 a9 . Working with encodings from the console with the python's interpreter is hard. There are several things you have to get right.

First, your console default code page (encoding). You can check this by issuing chcp command. Mine says 437, but it hardly depends on your windows installation.

Code page for latin-1 is 28591 and code page for utf-8 is 65001 . Odd enough, is complicated to use the python interpreter when the console has code page 65001, seems like there hasn't been declared it is a synonym for utf-8 in python's encoding libraries.

My point here is that you have to get your mind right. If your console is in code page X, your input to the python's interpreter will be encoding in X, and you'll see the output the way X is able to manage the bytes.

I'll suggest you to use unicode instead of hard encoded strings in python, and use scape bytes instead of characters. For example, you can declare robin like this:

robin = u'test\xe9'

U+00E9 is the code for é . After that, robin is unicode and can be encoded into any econding you want like this: robin.encode('utf-8') . This way you have control over the variable to encode it in any encode for every possible output scenario.

To resume it:

  1. Figure out your console's encoding
  2. encode the robin variable according to this encoding
  3. The console should output it right

Hope this is helpful!

You call PyObject_Repr which is the same as repr(robin) in Python, and produces the literal characters \\x82 . Leave it out from your chain of calls.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM