简体   繁体   English

libxml2 xmlChar *到std :: wstring

[英]libxml2 xmlChar * to std::wstring

libxml2 seems to store all its strings in UTF-8, as xmlChar * . libxml2似乎将所有字符串存储在UTF-8中,如xmlChar *

/**
 * xmlChar:
 *
 * This is a basic byte in an UTF-8 encoded string.
 * It's unsigned allowing to pinpoint case where char * are assigned
 * to xmlChar * (possibly making serialization back impossible).
 */
typedef unsigned char xmlChar;

As libxml2 is a C library, there's no provided routines to get an std::wstring out of an xmlChar * . 由于libxml2是一个C库,因此没有提供从xmlChar *获取std::wstring例程。 I'm wondering whether the prudent way to convert xmlChar * to a std::wstring in C++11 is to use the mbstowcs C function, via something like this (work in progress): 我想知道将xmlChar *转换为C ++ 11中的std::wstring谨慎方法是使用mbstowcs C函数,通过类似这样的东西(正在进行中):

std::wstring xmlCharToWideString(const xmlChar *xmlString) {
    if(!xmlString){abort();} //provided string was null
    int charLength = xmlStrlen(xmlString); //excludes null terminator
    wchar_t *wideBuffer = new wchar_t[charLength];
    size_t wcharLength = mbstowcs(wideBuffer, (const char *)xmlString, charLength);
    if(wcharLength == (size_t)(-1)){abort();} //mbstowcs failed
    std::wstring wideString(wideBuffer, wcharLength);
    delete[] wideBuffer;
    return wideString;
}

Edit: Just an FYI, I'm very aware of what xmlStrlen returns; 编辑:只是一个FYI,我非常清楚xmlStrlen返回的内容; it's the number of xmlChar used to store the string; 它是用于存储字符串的xmlChar的数量; I know it's not the number of characters but rather the number of unsigned char . 我知道这不是字符数,而是unsigned char的数量。 It would have been less confusing if I had named it byteLength , but I thought it would have been clearer as I have both charLength and wcharLength . 如果我将它命名为byteLength ,那本来就不那么令人困惑了,但我认为它会更加清晰,因为我有charLengthwcharLength As for the correctness of the code, the wideBuffer will be larger or equal to the required size to hold the buffer, always (I believe). 至于代码的正确性,宽缓冲区将大于或等于保持缓冲区所需的大小,总是(我相信)。 As characters that require more space than wide_t will be truncated (I think). 因为需要比wide_t更多空间的wide_t将被截断(我认为)。

xmlStrlen() returns the number of UTF-8 encoded codeunits in the xmlChar* string. xmlStrlen()返回xmlChar*字符串中UTF-8编码的代码单元的数量。 That is not going to be the same number of wchar_t encoded codeunits needed when the data is converted, so do not use xmlStrlen() to allocate the size of your wchar_t string. 这不会是转换数据时所需的wchar_t编码代码的数量相同,因此不要使用xmlStrlen()来分配wchar_t字符串的大小。 You need to call std::mbtowc() once to get the correct length, then allocate the memory, and call mbtowc() again to fill the memory. 您需要调用std::mbtowc()一次以获得正确的长度,然后分配内存,并再次调用mbtowc()来填充内存。 You will also have to use std::setlocale() to tell mbtowc() to use UTF-8 (messing with the locale may not be a good idea, especially if multiple threads are involved). 您还必须使用std::setlocale()来告诉mbtowc()使用UTF-8(弄乱语言环境可能不是一个好主意,特别是如果涉及多个线程)。 For example: 例如:

std::wstring xmlCharToWideString(const xmlChar *xmlString)
{    
    if (!xmlString) { abort(); } //provided string was null

    std::wstring wideString;

    int charLength = xmlStrlen(xmlString);
    if (charLength > 0)
    {
        char *origLocale = setlocale(LC_CTYPE, NULL);
        setlocale(LC_CTYPE, "en_US.UTF-8");

        size_t wcharLength = mbtowc(NULL, (const char*) xmlString, charLength); //excludes null terminator
        if (wcharLength != (size_t)(-1))
        {
            wideString.resize(wcharLength);
            mbtowc(&wideString[0], (const char*) xmlString, charLength);
        }

        setlocale(LC_CTYPE, origLocale);
        if (wcharLength == (size_t)(-1)) { abort(); } //mbstowcs failed
    }

    return wideString;
}

A better option, since you mention C++11, is to use std::codecvt_utf8 with std::wstring_convert instead so you do not have to deal with locales: 一个更好的选择,因为你提到C ++ 11,是使用std::codecvt_utf8std::wstring_convert所以你不必处理locales:

std::wstring xmlCharToWideString(const xmlChar *xmlString)
{    
    if (!xmlString) { abort(); } //provided string was null
    try
    {
        std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> conv;
        return conv.from_bytes((const char*)xmlString);
    }
    catch(const std::range_error& e)
    {
        abort(); //wstring_convert failed
    }
}

An alternative option is to use an actual Unicode library, such as ICU or ICONV, to handle Unicode conversions. 另一种选择是使用实际的Unicode库(如ICU或ICONV)来处理Unicode转换。

There are some problems in this code, besides the fact that you are using wchar_t and std::wstring which is a bad idea unless you're making calls to the Windows API. 这段代码中存在一些问题,除了你使用wchar_tstd::wstring ,这是一个坏主意,除非你正在调用Windows API。

  1. xmlStrlen() does not do what you think it does. xmlStrlen()不会按照您的想法执行操作。 It counts the number of UTF-8 code units (aka bytes) in a string. 它计算字符串中UTF-8代码单元(也称为字节)的数量。 It does not count the number of characters. 它不计算字符数。 This is all stuff in the documentation . 这是文档中的所有内容。

  2. Counting characters will not portably give you the correct size for a wchar_t array anyway. 无论如何,计数字符都不会为wchar_t数组提供正确的大小。 So not only does xmlStrlen() not do what you think it does, what you wanted isn't the right thing either. 所以xmlStrlen()不仅没有做你认为它做的事情,你想要的也不是正确的事情。 The problem is that the encoding of wchar_t varies from platform to platform, making it 100% useless for portable code. 问题是wchar_t的编码因平台而异,使其对可移植代码100%无用。

  3. The mbtowcs() function is locale-dependent. mbtowcs()函数依赖于语言环境。 It only converts from UTF-8 if the locale is a UTF-8 locale! 如果语言环境是UTF-8语言环境,它只能转换为UTF-8!

  4. This code will leak memory if the std::wstring constructor throws an exception. 如果std::wstring构造函数抛出异常,此代码将泄漏内存。

My recommendations: 我的建议:

  1. Use UTF-8 if at all possible. 尽可能使用UTF-8。 The wchar_t rabbit hole is a lot of extra work for no benefit (except the ability to make Windows API calls). wchar_t兔子洞是很多额外的工作, 没有任何好处(除了能够进行Windows API调用)。

  2. If you need UTF-32, then use std::u32string . 如果你需要UTF-32,那么使用std::u32string Remember that wstring has a platform-dependent encoding: it could be a variable-length encoding (Windows) or fixed-length (Linux, OS X). 请记住, wstring具有依赖于平台的编码:它可以是可变长度编码(Windows)或固定长度(Linux,OS X)。

  3. If you absolutely must have wchar_t , then chances are good that you are on Windows. 如果你绝对必须拥有wchar_t ,那么你在Windows上的机会很大。 Here is how you do it on Windows: 以下是在Windows上的操作方法:

     std::wstring utf8_to_wstring(const char *utf8) { size_t utf8len = std::strlen(utf8); int wclen = MultiByteToWideChar( CP_UTF8, 0, utf8, utf8len, NULL, 0); wchar_t *wc = NULL; try { wc = new wchar_t[wclen]; MultiByteToWideChar( CP_UTF8, 0, utf8, utf8len, wc, wclen); std::wstring wstr(wc, wclen); delete[] wc; wc = NULL; return wstr; } catch (std::exception &) { if (wc) delete[] wc; } } 
  4. If you absolutely must have wchar_t and you are not on Windows, use iconv() (see man 3 iconv , man 3 iconv_open and man 3 iconv_close for the manual). 如果你绝对必须有wchar_t并且你不在Windows上,请使用iconv() (参见man 3 iconvman 3 iconv_openman 3 iconv_close手册)。 You can specify "WCHAR_T" as one of the encodings for iconv() . 您可以将"WCHAR_T"指定为iconv()的编码之一。

Remember: You probably don't want wchar_t or std::wstring . 记住:你可能不想要wchar_tstd::wstring What wchar_t does portably isn't useful, and making it useful isn't portable. 什么wchar_t可以移植是没有用的,并使它有用是不可移植的。 C'est la vie. 这就是生活。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM