简体   繁体   English

使用std :: codecvt_xxx将C ++ std :: wstring转换为utf8

[英]Converting C++ std::wstring to utf8 with std::codecvt_xxx

C++11 has tools to convert wide char strings std::wstring from/to utf8 representation: std::codecvt , std::codecvt_utf8 , std::codecvt_utf8_utf16 etc. C ++ 11具有将宽字符串std::wstring从/转换为utf8表示的工具: std::codecvtstd::codecvt_utf8std::codecvt_utf8_utf16等。

Which one is usable by Windows app to convert regular wide char Windows strings std::wstring to utf8 std::string ? Windows应用程序可以使用哪一个将常规宽字符串Windows字符串std::wstring为utf8 std::string Is it always works without configuring locales? 它是否始终无法配置区域设置?

Depends how you convert them. 取决于你如何转换它们。
You need to specify the source encoding type and the target encoding type. 您需要指定源编码类型和目标编码类型。
wstring is not a format, it just defines a data type. wstring不是一种格式,它只是定义了一种数据类型。

Now usually when one says "Unicode", one means UTF16 which is what Microsoft Windows uses, and that is usuasly what wstring contains. 现在通常当一个人说“Unicode”时,一个意味着UTF16 ,这是Microsoft Windows使用的,这通常是wstring包含的内容。

So, the right way to convert from UTF8 to UTF16: 那么,从UTF8转换为UTF16的正确方法:

     std::string utf8String = "blah blah";

     std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
     std::wstring utf16String = convert.from_bytes( utf8String );

And the other way around: 反过来说:

     std::wstring utf16String = "blah blah";

     std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
     std::string utf8String = convert.to_bytes( utf16String );

And to add to the confusion: 并增加了混乱:
When you use std::string on a windows platform (like when you use a multibyte compilation), It's NOT UTF8 . 当你在Windows平台上使用std::string时(比如当你使用多字节编译时),它不是UTF8 They use ANSI . 他们使用ANSI
More specifically, the default encoding language your windows is using. 更具体地说,是Windows正在使用的默认编码语言。

Also, note that wstring is not exactly the same as UTF-16 . 另请注意, wstring与UTF-16不完全相同

When compiling in Unicode the windows API commands expect these formats: 在Unicode中编译时,windows API命令需要以下格式:

Command A - multibyte - ANSI 命令A - 多字节 - ANSI
Command W - Unicode - UTF16 命令W - Unicode - UTF16

Seems that std::codecvt_utf8 works well for conversion std::wstring -> utf8 . 似乎std::codecvt_utf8适用于转换std::wstring - > utf8 It passed all my tests. 它通过了我所有的测试。 (Windows app, Visual Studio 2015, Windows 8 with EN locale) (Windows应用程序,Visual Studio 2015,带有EN语言环境的Windows 8)

I needed a way to convert filenames to UTF8. 我需要一种方法将文件名转换为UTF8。 Therefore my test is about filenames. 因此我的测试是关于文件名。

In my app I use boost::filesystem::path 1.60.0 to deal with file path. 在我的应用程序中,我使用boost::filesystem::path 1.60.0来处理文件路径。 It works well, but not able to convert filenames to UTF8 properly. 它运行良好,但无法正确转换文件名为UTF8。 Internally Windows version of boost::filesystem::path uses std::wstring to store the file path. 内部Windows版本的boost::filesystem::path使用std::wstring来存储文件路径。 Unfortunately, build-in conversion to std::string works bad. 不幸的是,对std::string内置转换很糟糕。

Test case: 测试用例:

  • create file with mixed symbols c:\\test\\皀皁皂皃的 (some random Asian symbols) 用混合符号创建文件c:\\test\\皀皁皂皃的 (一些随机的亚洲符号)
  • scan dir with boost::filesystem::directory_iterator , get boost::filesystem::path for the file 使用boost::filesystem::directory_iterator扫描boost::filesystem::directory_iterator ,获取boost::filesystem::path
  • convert it to the std::string via build-in conversion filenamePath.string() 通过内置转换filenamePath.string()将其转换为std::string
  • you get c:\\test\\????? 你得到c:\\test\\????? . Asian symbols converted to '?'. 亚洲符号转换为'?'。 Not good. 不好。

boost::filesystem uses std::codecvt internally. boost::filesystem内部使用std::codecvt It doesn't work for conversion std::wstring -> std::string . 它不适用于转换std::wstring - > std::string

Instead of build-in boost::filesystem::path conversion you can define conversion function as this ( original snippet ): 您可以将转换函数定义为此( 原始代码段 ),而不是内置boost::filesystem::path转换:

std::string utf8_to_wstring(const std::wstring & str)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
    return myconv.to_bytes(str);
}

Then you can convert filepath to UTF8 easily: utf8_to_wstring(filenamePath.wstring()) . 然后您可以轻松地将filepath转换为UTF8: utf8_to_wstring(filenamePath.wstring()) It works perfectly. 它完美地运作。

It works for any filepath. 它适用于任何文件路径。 I tested ASCII strings c:\\test\\test_file , Asian strings c:\\test\\皀皁皂皃的 , Russian strings c:\\test\\абвгд , mixed strings c:\\test\\test_皀皁皂皃的 , c:\\test\\test_абвгд , c:\\test\\test_皀皁皂皃的_абвгд . 我测试了ASCII字符串c:\\test\\test_file ,亚洲字符串c:\\test\\皀皁皂皃的 ,俄语字符串c:\\test\\абвгд ,混合字符串c:\\test\\test_皀皁皂皃的c:\\test\\test_абвгдc:\\test\\test_皀皁皂皃的_абвгд For every string I receive valid UTF8 representation. 对于每个字符串,我都会收到有效的UTF8表示。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM