简体   繁体   中英

C++ Utf-8 conversion using atlconv.h / W2A and Chinese texts

I'm performing wchar_t* to UTF-8 conversion like following:

char* DupString(wchar_t* t)
{ 
    if(!t) return strdup("");
    USES_CONVERSION;
    _acp = CP_UTF8;
    return strdup(W2A(t));
}

Normally it works fine, but now I've located one Chinese text "主体" - with which conversion does not work correctly.

Macro itself is defined like this:

#define W2A(lpw) (\
    ((_lpw = lpw) == NULL) ? NULL : (\
        (_convert = (lstrlenW(_lpw)+1), \
        (_convert>INT_MAX/2) ? NULL : \
        ATLW2AHELPER((LPSTR) alloca(_convert*sizeof(WCHAR)), _lpw, _convert*sizeof(WCHAR), _acp))))

In my case _convert = 2 + 1 = 3. When passed to function call 3 * sizeof(WCHAR) = 6.

In atlconv.h / AtlW2AHelper - it hits into WideCharToMultiByte and ret == 0.

_Ret_opt_z_cap_(nChars) inline LPSTR WINAPI AtlW2AHelper(
    _Out_opt_z_cap_(nChars) LPSTR lpa, 
    _In_opt_z_ LPCWSTR lpw, 
    _In_ int nChars, 
    _In_ UINT acp) throw()
{
    ATLASSERT(lpw != NULL);
    ATLASSERT(lpa != NULL);
    if (lpa == NULL || lpw == NULL)
        return NULL;
    // verify that no illegal character present
    // since lpa was allocated based on the size of lpw
    // don't worry about the number of chars
    *lpa = '\0';
    int ret = WideCharToMultiByte(acp, 0, lpw, -1, lpa, nChars, NULL, NULL);
    if(ret == 0)
    {
        ATLASSERT(FALSE);
        return NULL;
    }
    return lpa;
}

@err in Watch windows shows error code 122 = ERROR_INSUFFICIENT_BUFFER.

I've tried to increase buffer by one byte - nChars = 7 - then conversion does succeeds. Buffer is filled with 6 bytes + 1 ascii zero termination - so 7 bytes filled.

Is this a bug of W2A macro - ascii zero is not taken into account ?

Has anyone seen similar problem ?

As a platform I'm using visual studio 2010, not sure if problem persists in other visual studio's as well.

In some header files this issue seems to be fixed - for example in here:

https://github.com/kxproject/kx-audio-driver/blob/master/h/gui/kDefs.h

But it's applicable to some 3-rd party project, not Visual studio itself.

W2A mistakenly assumes that a buffer of two bytes per character is sufficient for the conversion. Your string expands into a UTF-8 string of seven bytes including terminating zero. WideCharToMultiByte fails on insufficient buffer - this is what you already found.

It looks like a bug which you can fix yourself in ATL source (Microsoft will not update VS 2010 and I suppose it might be late to update even 2015 already) in atlconv.h:

#define W2A(lpw) (\
    ((_lpw = lpw) == NULL) ? NULL : (\
        (_convert = (static_cast<int>(wcslen(_lpw))+1), \
        (_convert>INT_MAX/2) ? NULL : \
        ATLW2AHELPER((LPSTR) alloca(_convert*sizeof(WCHAR)), _lpw, _convert*4, _acp)))) //sizeof(WCHAR), _acp))))

Or you can use newer CW2A conversion macros which already allocate larger buffers (4 bytes per character, see CW2AEX::Init ):

static const LPCWSTR g_psz = L"主体";
LPCSTR psz = _strdup(CW2A(g_psz, CP_UTF8));

Copy paste from Microsoft forum, from here:

https://social.msdn.microsoft.com/Forums/en-US/262e7b83-8cf4-45ed-a3db-5dc6064612f2/c-utf8-conversion-using-atlconvh-w2a-and-chinese-texts?forum=vcgeneral&prof=required

Have you considered using the improved ATL7 macro? https://msdn.microsoft.com/en-us/library/87zae4a3.aspx#atl70stringconversionclassesmacros

 CW2A pA( pW, CP_UTF8 ); 

This seems to assume 4 bytes max per Unicode character, rather than 2 that the old one does.

This seems to be slightly better usage of string, because CW2A's destructor will release conversion buffer.

 wchar_t* pStr = NULL;
 {
     CW2A pA( pW, CP_UTF8 );

     pStr = pA;
     // pStr is valid
 }
 // pStr is invalid

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM