简体   繁体   中英

C Unicode: How do I apply C11 standard amendment DR488 fix to C11 standard function c16rtomb()?

Question:

As mentioned in the C reference page for the function, c16rtomb , from CPPReference , under the Notes section:

In C11 as published, unlike mbrtoc16 , which converts variable-width multibyte (such as UTF-8) to variable-width 16-bit (such as UTF-16) encoding, this function can only convert single-unit 16-bit encoding, meaning it cannot convert UTF-16 to UTF-8 despite that being the original intent of this function. This was corrected by the post-C11 defect report DR488 .

And below this passage, the C reference page provided an example source code with the following sentence above it:

Note: this example assumes the fix for the defect report 488 is applied.

That phrase implied there is a way to take DR488 and somehow "apply" the fix to the C11 standard function, c16rtomb .

I would like to know how to apply the fix for GCC. Because it seems to me the fix was already applied to Visual Studio 2017 Visual C++, as of v141.

The behavior seen in GCC, when debugging the code in GDB, is consistent with what was found in DR488, as follows:

Section 7.28.1 describes the function c16rtomb(). In particular, it states "When c16 is not a valid wide character, an encoding error occurs". "wide character" is defined in section 3.7.3 as "value representable by an object of type wchar_t, capable of representing any character in the current locale". This wording seems to imply that, eg for the common cases (eg, an implementation that defines __STDC_UTF_16__ and a program that uses an UTF-8 locale), c16rtomb() will return -1 when it encounters a character that is encoded as multiple char16_t (for UTF-16 a wide character can be encoded as a surrogate pair consisting of two char16_t). In particular, c16rtomb() will not be able to process strings generated by mbrtoc16().

The boldfaced text is the behavior described.

Source code:

#include <stdio.h>
#include <uchar.h>

#define __STD_UTF_16__

int main() {
    char16_t* ptr_string = (char16_t*) u"我是誰";

    //C++ disallows variable-length arrays. 
    //GCC uses GNUC++, which has a C++ extension for variable length arrays.
    //It is not a truly standard feature in C++ pedantic mode at all.
    //https://stackoverflow.com/questions/40633344/variable-length-arrays-in-c14
    char buffer[64];
    char* bufferOut = buffer;

    //Must zero this object before attempting to use mbstate_t at all.
    mbstate_t multiByteState = {};

    //c16 = 16-bit Characters or char16_t typed characters
    //r = representation
    //tomb = to Multi-Byte Strings
    while (*ptr_string) {
        char16_t character = *ptr_string;
        size_t size = c16rtomb(bufferOut, character, &multiByteState);
        if (size == (size_t) -1)
            break;
        bufferOut += size;
        ptr_string++;
    }

    size_t bufferOutSize = bufferOut - buffer;
    printf("Size: %zu - ", bufferOutSize);
    for (int i = 0; i < bufferOutSize; i++) {
        printf("%#x ", +(unsigned char) buffer[i]);
    }

    //This statement is used to set a breakpoint. It does not do anything else.
    int debug = 0;
    return 0;
}

Output from Visual Studio:

Size: 9 - 0xe6 0x88 0x91 0xe6 0x98 0xaf 0xe8 0xaa 0xb0

Output from GCC:

Size: 0 -

In Linux you should be able to fix this with a call to setlocale(LC_ALL, "en_US.utf8");

Example on ideone

This function will do the following, as stated in Microsoft documentation :

Convert a UTF-16 wide character into a multibyte character in the current locale .

The POSIX documentation is similar. __STD_UTF_16__ doesn't seem to have an effect in either compiler. It's supposed to specify the encoding for the source, which should be UTF16. It doesn't specify the encoding for destination.

It's Windows documentation which seems more inconsistent, because it seems to imply that setlocale is necessary or converting to ANSI code page is an option

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM