简体   繁体   中英

Why unicode char is stored as UTF-8 in std::string and UTF-16/32 in wchar_t?

I have a small piece of code:

#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string>

wchar_t widec('€');
wchar_t widecl(L'€');
std::string tc("€");

int main(int argc, char *argv[])
{
    printf("printf as hex - std::string tc(\"€\") = %x %x %x\n\r", tc.c_str()[0], tc.c_str()[1], tc.c_str()[2]);
    printf("printf as hex - wchar_t widec('€') = %x\n\r", widec);
    printf("printf as hex - wchar_t widecl(L'€') = %x\n\r", widecl);

    return 0;
}

That outputs:

printf as hex - std::string tc("€") = ffffffe2 ffffff82 ffffffac
printf as hex - wchar_t widec('€') = e282ac
printf as hex - wchar_t widecl(L'€') = 20ac

I don't understand two things.

  1. Why tc.c_str() (its [0] , [1] and [2] indexes to be exact) is printed as UTF-8 looking like UTF-16/32 with leading FF bytes?

  2. Why initializing the same wchar_t variable gives different output depending on whether L prefix is used, or not, ie. using it seems to produce UTF-16/32 content and UTF-8 without L prefix, why is that?

  1. A char without an explicit sign specifier is either signed or unsigned , depending on the compiler. The standard does not dictate the default type, it is the compiler vendor's choice.

    Passing a char to print() extends the value from 8bit to 32bit on the call stack. Then %x prints the bits of that 32bit value, ignoring leading zeros by default (unless you use a length specifier on %x to preserve them). How the 8bit value is extended to 32bit depends on its actual type.

    In your case, the extra f s you see are due to the char values being sign-extended . The high bit of 0xEx , 0x8x , and 0xAx are all 1, and so 1 is used to fill in the high 24bits during extension. This means your compiler implements char as a signed type, and is extending the values to signed int . You can manually type-cast the char values to unsigned to force them to be zero-extended instead:

     printf("printf as hex - std::string tc(\\"€\\") = %x %x %x\\n", (unsigned char) tc[0], (unsigned char) tc[1], (unsigned char) tc[2]); 

    (note that I removed the use of c_str() , it is unnecessary in your example)

  2. The interpretation of '€' and "€" without any prefixes is subject to the encoding that your source file is saved as, and the encoding that the compiler is configured to operate in.

    The only way the un-prefixed '€' and "€" literals could be in UTF-8 is if your source code file is saved in UTF-8 (to force UTF-8 literals, you can use the u8 prefix in C++11 and later). Save the file in a different encoding, and you will see different results. The result of that interpretation is then assigned as-is to tc , and encoded as-is as a wchar_t in widec .

    The L prefix, on the other hand, forces the compiler to interpret L'€' as a wide literal instead of a narrow literal, so there is no question of how it should be interpreted. It knows the literal is Unicode, and so it deterimines the Unicode codepoint value before then encoding it as a wchar_t value ( wchar_t is 16-bit on Windows, and 32-bit on other platforms) in widecl . The Unicode codepoint of is U+20AC EURO SIGN .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM