libxml2 xmlChar * to std::wstring

Question

libxml2 seems to store all its strings in UTF-8, as xmlChar * .

/**
 * xmlChar:
 *
 * This is a basic byte in an UTF-8 encoded string.
 * It's unsigned allowing to pinpoint case where char * are assigned
 * to xmlChar * (possibly making serialization back impossible).
 */
typedef unsigned char xmlChar;

As libxml2 is a C library, there's no provided routines to get an std::wstring out of an xmlChar * . I'm wondering whether the prudent way to convert xmlChar * to a std::wstring in C++11 is to use the mbstowcs C function, via something like this (work in progress):

std::wstring xmlCharToWideString(const xmlChar *xmlString) {
    if(!xmlString){abort();} //provided string was null
    int charLength = xmlStrlen(xmlString); //excludes null terminator
    wchar_t *wideBuffer = new wchar_t[charLength];
    size_t wcharLength = mbstowcs(wideBuffer, (const char *)xmlString, charLength);
    if(wcharLength == (size_t)(-1)){abort();} //mbstowcs failed
    std::wstring wideString(wideBuffer, wcharLength);
    delete[] wideBuffer;
    return wideString;
}

Edit: Just an FYI, I'm very aware of what xmlStrlen returns; it's the number of xmlChar used to store the string; I know it's not the number of characters but rather the number of unsigned char . It would have been less confusing if I had named it byteLength , but I thought it would have been clearer as I have both charLength and wcharLength . As for the correctness of the code, the wideBuffer will be larger or equal to the required size to hold the buffer, always (I believe). As characters that require more space than wide_t will be truncated (I think).

Answer 1

xmlStrlen() returns the number of UTF-8 encoded codeunits in the xmlChar* string. That is not going to be the same number of wchar_t encoded codeunits needed when the data is converted, so do not use xmlStrlen() to allocate the size of your wchar_t string. You need to call std::mbtowc() once to get the correct length, then allocate the memory, and call mbtowc() again to fill the memory. You will also have to use std::setlocale() to tell mbtowc() to use UTF-8 (messing with the locale may not be a good idea, especially if multiple threads are involved). For example:

std::wstring xmlCharToWideString(const xmlChar *xmlString)
{    
    if (!xmlString) { abort(); } //provided string was null

    std::wstring wideString;

    int charLength = xmlStrlen(xmlString);
    if (charLength > 0)
    {
        char *origLocale = setlocale(LC_CTYPE, NULL);
        setlocale(LC_CTYPE, "en_US.UTF-8");

        size_t wcharLength = mbtowc(NULL, (const char*) xmlString, charLength); //excludes null terminator
        if (wcharLength != (size_t)(-1))
        {
            wideString.resize(wcharLength);
            mbtowc(&wideString[0], (const char*) xmlString, charLength);
        }

        setlocale(LC_CTYPE, origLocale);
        if (wcharLength == (size_t)(-1)) { abort(); } //mbstowcs failed
    }

    return wideString;
}

A better option, since you mention C++11, is to use std::codecvt_utf8 with std::wstring_convert instead so you do not have to deal with locales:

std::wstring xmlCharToWideString(const xmlChar *xmlString)
{    
    if (!xmlString) { abort(); } //provided string was null
    try
    {
        std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> conv;
        return conv.from_bytes((const char*)xmlString);
    }
    catch(const std::range_error& e)
    {
        abort(); //wstring_convert failed
    }
}

An alternative option is to use an actual Unicode library, such as ICU or ICONV, to handle Unicode conversions.

Answer 2

There are some problems in this code, besides the fact that you are using wchar_t and std::wstring which is a bad idea unless you're making calls to the Windows API.

xmlStrlen() does not do what you think it does. It counts the number of UTF-8 code units (aka bytes) in a string. It does not count the number of characters. This is all stuff in the documentation .
Counting characters will not portably give you the correct size for a wchar_t array anyway. So not only does xmlStrlen() not do what you think it does, what you wanted isn't the right thing either. The problem is that the encoding of wchar_t varies from platform to platform, making it 100% useless for portable code.
The mbtowcs() function is locale-dependent. It only converts from UTF-8 if the locale is a UTF-8 locale!
This code will leak memory if the std::wstring constructor throws an exception.

My recommendations:

Use UTF-8 if at all possible. The wchar_t rabbit hole is a lot of extra work for no benefit (except the ability to make Windows API calls).
If you need UTF-32, then use std::u32string . Remember that wstring has a platform-dependent encoding: it could be a variable-length encoding (Windows) or fixed-length (Linux, OS X).

If you absolutely must have wchar_t , then chances are good that you are on Windows. Here is how you do it on Windows:

 std::wstring utf8_to_wstring(const char *utf8) { size_t utf8len = std::strlen(utf8); int wclen = MultiByteToWideChar( CP_UTF8, 0, utf8, utf8len, NULL, 0); wchar_t *wc = NULL; try { wc = new wchar_t[wclen]; MultiByteToWideChar( CP_UTF8, 0, utf8, utf8len, wc, wclen); std::wstring wstr(wc, wclen); delete[] wc; wc = NULL; return wstr; } catch (std::exception &) { if (wc) delete[] wc; } }

If you absolutely must have wchar_t and you are not on Windows, use iconv() (see man 3 iconv , man 3 iconv_open and man 3 iconv_close for the manual). You can specify "WCHAR_T" as one of the encodings for iconv() .

Remember: You probably don't want wchar_t or std::wstring . What wchar_t does portably isn't useful, and making it useful isn't portable. C'est la vie.

libxml2 xmlChar * to std::wstring

Question

2 answers

solution1
6 ACCPTED 2013-01-01 02:13:02

solution2
2 2013-01-01 02:04:24

libxml2 xmlChar * to std::wstring

Question

2 answers

solution1 6 ACCPTED 2013-01-01 02:13:02

solution2 2 2013-01-01 02:04:24

solution1
6 ACCPTED 2013-01-01 02:13:02

solution2
2 2013-01-01 02:04:24