简体   繁体   English

当字符串中包含非ASCII字符时,如何将C字符串(char数组)转换为Python字符串?

[英]How to convert a C string (char array) into a Python string when there are non-ASCII characters in the string?

I have embedded a Python interpreter in a C program. 我已经在C程序中嵌入了Python解释器。 Suppose the C program reads some bytes from a file into a char array and learns (somehow) that the bytes represent text with a certain encoding (eg, ISO 8859-1, Windows-1252, or UTF-8). 假定C程序从文件中读取一些字节到char数组中,并(以某种方式)获悉这些字节代表具有特定编码的文本(例如,ISO 8859-1,Windows-1252或UTF-8)。 How do I decode the contents of this char array into a Python string? 如何将此char数组的内容解码为Python字符串?

The Python string should in general be of type unicode —for instance, a 0x93 in Windows-1252 encoded input becomes a u'\ȁc' . Python字符串通常应为unicode类型-例如,Windows-1252编码输入中的0x93变为u'\ȁc'

I have attempted to use PyString_Decode , but it always fails when there are non-ASCII characters in the string. 我尝试使用PyString_Decode ,但是当字符串中包含非ASCII字符时,它总是会失败。 Here is an example that fails: 这是一个失败的示例:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string;

     Py_Initialize();

     py_string = PyString_Decode(c_string, 1, "windows_1252", "replace");
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     return 0;
}

The error message is UnicodeEncodeError: 'ascii' codec can't encode character u'\“' in position 0: ordinal not in range(128) , which indicates that the ascii encoding is used even though we specify windows_1252 in the call to PyString_Decode . 错误消息为UnicodeEncodeError: 'ascii' codec can't encode character u'\“' in position 0: ordinal not in range(128) ,这表示即使我们在对PyString_Decode的调用中指定了windows_1252 ,也使用了ascii编码。 。

The following code works around the problem by using PyString_FromString to create a Python string of the undecoded bytes, then calling its decode method: 以下代码通过使用PyString_FromString创建未解码字节的Python字符串,然后调用其decode方法来解决该问题:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *raw, *decoded;

     Py_Initialize();

     raw = PyString_FromString(c_string);
     printf("Undecoded: ");
     PyObject_Print(raw, stdout, 0);
     printf("\n");
     decoded = PyObject_CallMethod(raw, "decode", "s", "windows_1252");
     Py_DECREF(raw);
     printf("Decoded: ");
     PyObject_Print(decoded, stdout, 0);
     printf("\n");
     return 0;
}

PyString_Decode does this: PyString_Decode这样做:

PyObject *PyString_Decode(const char *s,
              Py_ssize_t size,
              const char *encoding,
              const char *errors)
{
    PyObject *v, *str;

    str = PyString_FromStringAndSize(s, size);
    if (str == NULL)
    return NULL;
    v = PyString_AsDecodedString(str, encoding, errors);
    Py_DECREF(str);
    return v;
}

IOW, it does basically what you're doing in your second example - converts to a string, then decode the string. IOW,它基本上完成了第二个示例中的操作-转换为字符串,然后对该字符串进行解码。 The problem here arises from PyString_AsDecodedString, rather than PyString_AsDecodedObject. 这里的问题来自PyString_AsDecodedString,而不是PyString_AsDecodedObject。 PyString_AsDecodedString does PyString_AsDecodedObject, but then tries to convert the resulting unicode object into a string object with the default encoding (for you, looks like that's ASCII). PyString_AsDecodedString会执行PyString_AsDecodedObject,但随后尝试将生成的unicode对象转换为具有默认编码的字符串对象(对您来说,看起来像是ASCII)。 That's where it fails. 那就是失败的地方。

I believe you'll need to do two calls - but you can use PyString_AsDecodedObject rather than calling the python "decode" method. 我相信您需要进行两次调用-但您可以使用PyString_AsDecodedObject而不是调用python的“ decode”方法。 Something like: 就像是:

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string, *py_unicode;

     Py_Initialize();

     py_string = PyString_FromStringAndSize(c_string, 1);
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     py_unicode = PyString_AsDecodedObject(py_string, "windows_1252", "replace");
     Py_DECREF(py_string);

     return 0;
}

I'm not entirely sure what the reasoning behind PyString_Decode working this way is. 我不完全确定PyString_Decode以这种方式工作的原因是什么。 A very old thread on python-dev seems to indicate that it has something to do with chaining the output, but since the Python methods don't do the same, I'm not sure if that's still relevant. python-dev上的一个很老的线程似乎表明它与链接输出有关,但是由于Python方法没有做同样的事情,所以我不确定这是否仍然有用。

You don't want to decode the string into a Unicode representation, you just want to treat it as an array of bytes, right? 您不想将字符串解码为Unicode表示形式,只想将其视为字节数组,对吧?

Just use PyString_FromString : 只需使用PyString_FromString

char *cstring;
PyObject *pystring = PyString_FromString(cstring);

That's all. 就这样。 Now you have a Python str() object. 现在,您有了一个Python str()对象。 See docs here: https://docs.python.org/2/c-api/string.html 在这里查看文档: https : //docs.python.org/2/c-api/string.html

I'm a little bit confused about how to specify "str" or "unicode." 我对如何指定“ str”或“ unicode”有些困惑。 They are quite different if you have non-ASCII characters. 如果您使用的是非ASCII字符,则它们是完全不同的。 If you want to decode a C string and you know exactly what character set it's in, then yes, PyString_DecodeString is a good place to start. 如果要解码C字符串, 并且确切知道它包含的字符集,那么可以, PyString_DecodeString是一个不错的起点。

Try calling PyErr_Print() in the " if (!py_string) " clause. 尝试在“ if (!py_string) ”子句中调用PyErr_Print() Perhaps the python exception will give you some more information. 也许python异常会为您提供更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM