如何正确初始化宽字符串？

Question

I am trying to figure out wide characters in c. 我试图找出c中的宽字符。 For example, I test a string that contains a single letter "Ē" that is encoded as c492 in utf8. 例如，我测试一个包含单个字母“Ē”的字符串，该字母在utf8中编码为c492。

char* T1 = "Ē";
//This is the resulting array { 0xc4, 0x92, 0x00 }

wchar_t* T2 = L"Ē";
//This is the resulting array { 0x00c4, 0x2019, 0x0000 }

I expected that the second array would be {0xc492, 0x0000}, instead it contains an extra character that just wastes space in my opinion. 我预计第二个数组将是{0xc492,0x0000}，而不是它包含一个额外的字符，在我看来只会浪费空间。 Can anyone help me understand what is going on with this? 任何人都可以帮我理解这是怎么回事？

Answer 1

What you've managed to do here is mojibake. 你在这里做的是mojibake。 Your source code is written in UTF-8 but it was interpreted in Windows codepage 1252 (ie the compiler source character set was CP1252 ). 您的源代码是用UTF-8编写的，但它在Windows代码页1252中解释 （即编译器源字符集是CP1252 ）。

The wide string contents are the Windows codepage 1252 characters of the UTF-8 bytes 0xC4 0x92 converted to UCS-2. 宽字符串内容是转换为UCS-2的UTF-8字节0xC4 0x92的Windows代码页1252个字符。 The easiest way out is to just using an escape instead: 最简单的方法是使用转义：

wchar_t* T2 = L"\x112";

or 要么

wchar_t* T2 = L"\u0112";

The larger problem is that to my knowledge neither C nor C++ have a mechanism for specifying the source character set within the code itself, so it is always a setting or option external to something that you can easily copy-paste. 更大的问题是，据我所知，C和C ++都没有在代码本身中指定源字符集的机制，因此它总是一个设置或选项，可以轻松地复制粘贴。

Answer 2

Your compiler is misinterpreting your source code file (which is saved as UTF-8) as Windows-1252 (commonly called ANSI). 您的编译器将您的源代码文件（保存为UTF-8）误解为Windows-1252（通常称为ANSI）。 It does not interpret the byte sequence C4 92 as the one-character UTF-8 string "Ē", but as the two-character Windows-1252 string "Ä'" . 它不将字节序列C4 92解释为单字符UTF-8字符串“Ē”，而是将其解释为双字符Windows-1252字符串"Ä'" 。 The unicode codepoint of "Ä" is U+00C4, and the unicode codepoint of "'" is U+2019. "Ä"的unicode代码点是U + 00C4， "'"的unicode代码点是U + 2019。 This is exactly what you see in your wide character string. 这正是您在宽字符串中看到的内容。

The 8-bit string only works, because the misinterpretation of the string does not matter, as it is not converted during compilation. 8位字符串只能起作用，因为字符串的误解并不重要，因为它在编译期间不会被转换。 The compiler reads the string as Windows-1252 and emits the string as Windows-1252 (so it does not need to convert anything, and considers both to be "Ä'"). 编译器将字符串读取为Windows-1252并将字符串作为Windows-1252发出（因此它不需要转换任何内容，并认为两者都是“Ä”）。 You interpret the source code and the data in the binary as UTF-8, so you consider both to be "Ē" . 您将源代码和二进制文件中的数据解释为UTF-8，因此您认为两者都是"Ē" 。

To have the compiler treat your source code as UTF-8, use the switch /utf-8 . 要让编译器将源代码视为UTF-8，请使用switch / utf-8 。

BTW: The correct UTF-16 encoding (which is the encoding MSVC uses for wide character strings) to be observed in a wide-character string is not {0xc492, 0x0000} , but {0x0112, 0x0000} , because "Ē" is U+0112 . BTW：在宽字符串中观察到的正确的UTF-16编码（MSVC用于宽字符串的编码）不是 {0xc492, 0x0000} ，而是{0x0112, 0x0000} ，因为"Ē"是U+0112 。

如何正确初始化宽字符串？

问题描述

2 个解决方案

解决方案1
6 2019-04-22 13:38:14

解决方案2
4 已采纳 2019-04-22 13:43:24

如何正确初始化宽字符串？

问题描述

2 个解决方案

解决方案1 6 2019-04-22 13:38:14

解决方案2 4 已采纳 2019-04-22 13:43:24

解决方案1
6 2019-04-22 13:38:14

解决方案2
4 已采纳 2019-04-22 13:43:24