简体   繁体   English

将 utf16 宽 std::wstring 转换为 utf8 窄 std::string 以获取稀有字符时出现问题

[英]Issue when converting utf16 wide std::wstring to utf8 narrow std::string for rare characters

Why do some utf16 encoded wide strings, when converted to utf8 encoded narrow strings convert to hex values that don't appear to be correct when converted using this commonly found conversion function?为什么某些 utf16 编码的宽字符串在转换为 utf8 编码的窄字符串时会转换为使用此常见转换 function 进行转换时似乎不正确的十六进制值?

std::string convert_string(const std::wstring& str)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
    return conv.to_bytes(str);
}

Hello.你好。 I have a C++ app on Windows which takes some user input on the command line.我在 Windows 上有一个 C++ 应用程序,它在命令行上需要一些用户输入。 I'm using the wide character main entry point to get the input as a utf16 string which I'm converting to a utf8 narrow string using the above function.我使用宽字符主入口点将输入作为 utf16 字符串,我使用上面的 function 将其转换为 utf8 窄字符串。

This function can be found in many places online and works in almost all cases.这个 function 可以在网上的很多地方找到,并且几乎在所有情况下都可以使用。 I have however found a few examples where it doesn't seem to work as expected.然而,我发现了一些似乎没有按预期工作的例子。

For example if I input an emojii character "" as a string literal (in my utf8 encoded cpp file) and write it to disk, the file (FILE-1) contains the following data (which are the correct utf8 hex values specified here https://www.fileformat.info/info/unicode/char/1f922/index.htm ):例如,如果我输入一个 emojii 字符 "" 作为字符串文字(在我的 utf8 编码 cpp 文件中)并将其写入磁盘,则文件 (FILE-1) 包含以下数据(这是此处指定的正确 utf8 十六进制值https ://www.fileformat.info/info/unicode/char/1f922/index.htm ):

    0xF0 0x9F 0xA4 0xA2

However if I pass the emojii to my application on the command line and convert it to a utf8 string using the conversion function above and then write it to disk, the file (FILE-2) contains different raw bytes:但是,如果我在命令行上将表情符号传递给我的应用程序,并使用上面的转换 function 将其转换为 utf8 字符串,然后将其写入磁盘,则文件 (FILE-2) 包含不同的原始字节:

    0xED 0xA0 0xBE 0xED 0xB4 0xA2

While the second file seems to indicate the conversion has produced the wrong output if you copy and paste the hex values (in notepad++ at least) it produces the correct emojii.虽然第二个文件似乎表明转换产生了错误的 output 如果您复制并粘贴十六进制值(至少在记事本++中)它会产生正确的表情符号。 Also WinMerge considers the two files to be identical. WinMerge 还认为这两个文件是相同的。

so to conclude I would really like to know the following:所以总结一下,我真的很想知道以下内容:

  1. how the incorrect-looking converted hex values map correctly to the right utf8 character in the example above在上面的示例中,看起来不正确的转换十六进制值 map 如何正确地转换为正确的 utf8 字符
  2. why the conversion function converts some characters to this form while almost all other characters produce the expected raw bytes为什么转换 function 将某些字符转换为这种形式,而几乎所有其他字符都产生预期的原始字节
  3. As a bonus I would like to know if it is possible to modify the conversion function to stop it from outputting these rare characters in this form作为奖励,我想知道是否可以修改转换 function 以阻止它以这种形式输出这些稀有字符

I should note that I already have a workaround function below which uses WinAPI calls, however using standard library calls only is the dream:)我应该注意到我已经有一个解决方法 function 下面使用 WinAPI 调用,但是只使用标准库调用是梦想:)

std::string convert_string(const std::wstring& wstr)
{
    if(wstr.empty())
        return std::string();

    int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
    std::string strTo(size_needed, 0);
    WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
    return strTo;
}

The problem is that std::wstring_convert<std::codecvt_utf8<wchar_t>> converts from UCS-2, not from UTF-16 .问题是std::wstring_convert<std::codecvt_utf8<wchar_t>>从 UCS-2 转换,而不是从 UTF-16转换。 Characters inside of the BMP (U+0000..U+FFFF) have identical encodings in both UCS-2 and UTF-16 and so will work, but characters outside of the BMP (U+FFFF..U+10FFFF), such as your Emoji, do not exist in UCS-2 at all. BMP (U+0000..U+FFFF) 内部的字符在 UCS-2 和 UTF-16 中具有相同的编码,因此可以使用,但 BMP 之外的字符 (U+FFFF..U+10FFFF),例如作为您的表情符号,UCS-2 中根本不存在。 This means the conversion doesn't understand the character and produces incorrect UTF-8 bytes (technically, it's converted each half of the UTF-16 surrogate pair into a separate UTF-8 character).这意味着转换不理解字符并产生不正确的 UTF-8 字节(从技术上讲,它会将 UTF-16 代理对的每一半转换为单独的 UTF-8 字符)。

You need to use std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> instead.您需要改用std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>>

There is already a validated answer here.这里已经有一个经过验证的答案。 But for the records, here some additional information.但是为了记录,这里有一些额外的信息。

The encoding of the nauseated face emoji was introduced in Unicode in 2016. It is 4 utf-8 bytes ( 0xF0 0x9F 0xA4 0xA2 ) or 2 utf-16 words ( 0xD83E 0xDD22 ). 2016年的Unicode中引入了恶心的表情符号编码。它是4个utf-8字节( 0xF0 0x9F 0xA4 0xA2 )或2个utf-16字( 0xD83E 0xDD22

The surprising encoding of 0xED 0xA0 0xBE 0xED 0xB4 0xA2 corresponds in fact to an UCS surrogate pair : 0xED 0xA0 0xBE 0xED 0xB4 0xA2令人惊讶的编码实际上对应于 UCS 代理对

  • 0xED 0xA0 0xBE is the utf8 encoding of the high surrogate 0xD83E according to this conversion table .根据这个转换表0xED 0xA0 0xBE代理0xD83E的 utf8 编码。
  • 0xED 0xB4 0xA2 corresponds to the utf8 encoding of the low surrogate 0xDD22 according to this table .根据此表0xED 0xB4 0xA2对应于代理0xDD22的 utf8 编码。

So basically, your first encoding is the direct utf8.所以基本上,你的第一个编码是直接的 utf8。 The second encoding is the encoding in utf8 of an UCS-2 encoding that corresponds to the utf-16 encoding of the desired character.第二种编码是 UCS-2 编码的 utf8 编码,对应于所需字符的 utf-16 编码。

As the accepted answer rightly pointed out, the std::codecvt_utf8<wchar_t> is the culprit, because it's about UCS-2 and not UTF-16.正如公认的答案正确指出的那样, std::codecvt_utf8<wchar_t>是罪魁祸首,因为它是关于 UCS-2 而不是 UTF-16。

It's quite astonishing nowadays to find in standard libraries this obsolete encoding, but I suspect that this is still a reminiscence of Microsoft's lobying in the standard committee that dates back from the old Windows support for unicode with UCS-2.现在在标准库中找到这种过时的编码是相当令人惊讶的,但我怀疑这仍然是微软在标准委员会中游说的一种回忆,该标准委员会可以追溯到旧的 Windows 对带有 UCS-2 的 unicode 的支持。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM