简体   繁体   English

std :: codecvt_utf8_utf16不会在big-endian中将utf-8转换为utf-16

[英]std::codecvt_utf8_utf16 doesn't convert utf-8 to utf-16 in big-endian

I converted a string in utf-8 encoding to string in utf-16, by using wstring_convert & codecvt_utf8_utf16 我使用wstring_convertcodecvt_utf8_utf16 utf-8编码的字符串转换为utf-16的字符串

here is the sample code I tested: 这是我测试的示例代码:

#include <iostream>
#include <codecvt>
#include <string>

#include <fstream>
#include <cstdint>

std::u16string UTF8ToWide(const std::string& utf_str)
{
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> converter;
    return converter.from_bytes(utf_str);
}

void DisplayBytes(const void* data, size_t len)
{
    const uint8_t* src = static_cast<const uint8_t*>(data);
    for (size_t i = 0; i < len; ++i) {
        printf("%.2x ", src[i]);
    }
}

// the content is:"你好 hello chinese test 中文测试"
std::string utf8_s = "\xe4\xbd\xa0\xe5\xa5\xbd hello chinese test \xe4\xb8\xad\xe6\x96\x87\xe6\xb5\x8b\xe8\xaf\x95";

int main()
{
    auto ss = UTF8ToWide(utf8_s);
    DisplayBytes(ss.data(), ss.size() * sizeof(decltype(ss)::value_type));
    return 0;
}

according to reference manual , the default argument of std::codecvt_mode in the facet codecvt_utf8_utf16 is big-endian . 根据参考手册构面codecvt_utf8_utf16std::codecvt_mode的默认参数为big-endian

However, the test program displays bytes as follows 但是,测试程序将字节显示如下

60 4f 7d 59 20 00 68 00 65 00 6c 00 6c 00 6f 00 20 00 63 00 68 00 69 00 6e 00 65 00 73 00 65 00 20 00 74 00 65 00 73 00 74 00 20 00 2d 4e 87 65 4b 6d d5 8b 60 4f 7d 59 20 00 68 00 65 00 6c 00 6c 00 6f 00 20 00 63 00 68 00 69 00 6e 00 65 00 73 00 65 00 20 00 74 00 65 00 73 00 74 00 20 00 2d 4e 87 65 4b 6d d5 8b

which is in little-endian. 在little-endian中。

I ran the test code on Visual Studio 2013 and clang, respectively, and ended up with the same results. 我分别在Visual Studio 2013和clang上运行了测试代码,最终得到了相同的结果。

So, why is the big-endian mode of codecvt_utf8_utf16 doesn't have any effect on these conversions? 那么,为什么codecvt_utf8_utf16的big-endian模式对这些转换没有任何影响?

The same page you reference says the little_endian flag is for input only. 您引用的同一页面上说little_endian标志仅用于输入。 The output is a sequence of codepoints, not a byte stream. 输出是代码点序列,而不是字节流。 Each codepoint is represented using whatever is normal for the platform - in your case little endian. 每个代码点都使用平台正常的任何形式表示-在您的情况下为little endian。

Your program is just telling you how a char16_t is represented in memory. 您的程序只是告诉您如何在内存中表示char16_t

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM