简体繁体 English

如何使用C ++ 11语言环境设施将UTF-8用作字符串的内部表示？

[英]How to use C++11 locale facilities to use UTF-8 as internal representation of strings?

原文 2014-07-13 12:24:58 9 1 c++/ c++11/ encoding/ utf-8/ locale

I'm writing a portable library that deals with files and directories. 我正在编写一个处理文件和目录的可移植库。 I want to use UTF-8 for my input (directory paths) and output (file paths). 我想使用UTF-8作为输入（目录路径）和输出（文件路径）。 The problem is, Windows gives me a choice between UTF-16-that-used-to-be-UCS-2, and codepages. 问题是，Windows让我可以选择使用UTF-16-UCS-2和代码页。 So I have to convert all my UTF-8 strings to UTF-16, pass them to WinAPI, and convert the results back to UTF-8. 所以我必须将我的所有UTF-8字符串转换为UTF-16，将它们传递给WinAPI，然后将结果转换回UTF-8。 C++11 seems to provide <locale> library just for that, except from what I understood, none of the predefined specializations uses UTF-8 as internal (ie. my-side) coding - the closest there is is UTF-16-to-UTF-8, which is the exact opposite of what I want. C ++ 11似乎只提供了<locale>库，除了我所理解的，没有任何预定义的特化使用UTF-8作为内部（即我的侧面）编码 - 最接近的是UTF-16- to-UTF-8，这与我想要的完全相反 。 So here's first question: 所以这是第一个问题：

1) How to use codecvt thingamajigs to convert my UTF-8 strings to UTF-16 for WinAPI calls, and the UTF-16 results back to UTF-8? 1）如何使用codecvt thingamajigs将我的UTF-8字符串转换为UTF-16进行WinAPI调用，UTF-16结果返回UTF-8？

Another problem: I'm also targetting Linux. 另一个问题：我也在瞄准Linux。 On Linux, there is a very good support for many different locales - and I don't want to be any different. 在Linux上，对许多不同的语言环境有很好的支持 - 我不希望有任何不同。 Hopefully everyone will use UTF-8 on their Linux machines, but there is no strict guarantee of that. 希望每个人都在他们的Linux机器上使用UTF-8，但没有严格的保证。 So I thought it would be a good idea to extend the above Windows-specific behavior and always do UTF-8-to-system-locale-coding. 所以我认为扩展上面特定于Windows的行为并始终执行UTF-8到系统区域设置编码是一个好主意。 Except that I don't see there's any way in C++11's <locale> library to get current system encoding! 除了我没有看到C ++ 11的<locale>库中有任何方法可以获得当前的系统编码！ Default std::locale constructor makes specified-by-myself locale, and if I don't do it, it returns classic "C" locale. 默认的std :: locale构造函数使用自己定义的语言环境，如果不这样做，它将返回经典的“C”语言环境。 And there are no other getters I'm aware of. 并且没有其他我知道的吸气剂。 So here's second question: 所以这是第二个问题：

2) How to detect current system locale? 2）如何检测当前系统区域设置？ Something in <locale> ? <locale>有什么东西？ Maybe some standard C library function, or (less portable but okay in this case) something in POSIX API? 也许一些标准的C库函数，或者（在这种情况下可移动性较差）POSIX API中的东西？

1 个解决方案

The design of these facilities in the standard library assumes that multibyte character encodings (like UTF-8) are used only for external storage (ie byte sequences in files on disk) and that all characters in memory are uniform in size. 标准库中这些工具的设计假定多字节字符编码（如UTF-8）仅用于外部存储（即磁盘上文件中的字节序列），并且内存中的所有字符大小均匀。 This is so things like std::basic_string<T>::operator[] can behave in a manner consistent with the performance constraints imposed by the standard. 这就像std::basic_string<T>::operator[]这样的行为可以与标准强加的性能约束一致。 So while you can use files encoded in UTF-8 or some other MBCS (like those for Japanese), your strings in memory should be char , char16_t , char32_t or wchar_t . 因此，虽然您可以使用以UTF-8或其他MBCS编码的文件（如日语），但内存中的字符串应为char ， char16_t ， char32_t或wchar_t 。

This is why you aren't finding a match in the standard library for what you want to do because strings in memory aren't intended to be stored in UTF-8. 这就是为什么你没有在标准库中找到你想要做什么的原因，因为内存中的字符串不打算以UTF-8存储。 This is similar to other languages as well, such as Java, where data on disk is interpreted as a stream of bytes and to turn them into strings you need to tell some component the expected character encoding of the byte stream. 这类似于其他语言，例如Java，其中磁盘上的数据被解释为字节流并将它们转换为字符串，您需要告诉某个组件字节流的预期字符编码。 Some operating systems may stuff a UTF-8 string into argv[] , but this is non-standard. 某些操作系统可能会将UTF-8字符串填入argv[] ，但这是非标准的。 This is the reason that the Unicode enabled entry point for WinMain on Windows provides a NUL terminated pointer to wchar_t and not a char* pointing to a UTF-8 encoded string. 这就是为什么Windows上WinMain的Unicode启用入口点提供了一个NUL终止指向wchar_t指针，而不是指向UTF-8编码字符串的char* 。

IBM's International Components for Unicode library provides a whole set of components that are complementary to, and design to work with, the C++ standard library. IBM的Unicode国际组件库提供了一整套与C ++标准库互补和设计的组件。 I would look at their code conversion facilities. 我会看看他们的代码转换工具。 While the standard defines facilities in <locale> for code conversion, it doesn't guarantee any existence of a code conversion facility to map from UTF-8 to char16_t , char32_t , or wchar_t . 虽然标准在<locale>定义了用于代码转换的工具，但它不保证存在从UTF-8映射到char16_t ， char32_t或wchar_t的代码转换工具。 If such a thing exists, you'll only get it based on the details of your implementation. 如果存在这样的事情，您只能根据实施细节获得。 The ICU library provides this functionality portably for any C++ implementation. ICU库为任何C ++实现提供了可移植的功能。 It is well supported and well used and unlikely to have bugs decoding UTF-8 strings into the appropriate wider-than- char string. 这是很好的支持和良好使用，并不太可能有错误的UTF-8字符串解码成相应的更宽于─ char的字符串。

Konrad mentioned the UTF-8 Anywhere Manifesto in a comment. 康拉德在评论中提到了UTF-8 Anywhere Manifesto。 This was an interesting read and they point you to the Boost.Nowide library (not officially a part of boost yet) to get solutions to the problems you cite above. 这是一个有趣的阅读，它们指向Boost.Nowide库（尚未正式成为提升的一部分），以获得您在上面提到的问题的解决方案。

Please note that my answer is simply a description of the way the existing C++ standard library classes like std::basic_string<T> work. 请注意，我的答案只是描述现有C ++标准库类（如std::basic_string<T>工作方式。 It is not advice against UTF-8, Unicode, or anything else. 它不是针对UTF-8，Unicode或其他任何内容的建议。 The manifesto cited agrees with me that these things simply don't work this way and if you want to use UTF-8 anywhere, then you need something else. 引用的宣言同意我的观点，即这些东西根本无法以这种方式工作，如果你想在任何地方使用UTF-8，那么你需要别的东西。