简体   繁体   English

如何使用C ++ 11语言环境设施将UTF-8用作字符串的内部表示?

[英]How to use C++11 locale facilities to use UTF-8 as internal representation of strings?

I'm writing a portable library that deals with files and directories. 我正在编写一个处理文件和目录的可移植库。 I want to use UTF-8 for my input (directory paths) and output (file paths). 我想使用UTF-8作为输入(目录路径)和输出(文件路径)。 The problem is, Windows gives me a choice between UTF-16-that-used-to-be-UCS-2, and codepages. 问题是,Windows让我可以选择使用UTF-16-UCS-2和代码页。 So I have to convert all my UTF-8 strings to UTF-16, pass them to WinAPI, and convert the results back to UTF-8. 所以我必须将我的所有UTF-8字符串转换为UTF-16,将它们传递给WinAPI,然后将结果转换回UTF-8。 C++11 seems to provide <locale> library just for that, except from what I understood, none of the predefined specializations uses UTF-8 as internal (ie. my-side) coding - the closest there is is UTF-16-to-UTF-8, which is the exact opposite of what I want. C ++ 11似乎只提供了<locale>库,除了我所理解的,没有任何预定义的特化使用UTF-8作为内部(即我的侧面)编码 - 最接近的是UTF-16- to-UTF-8,这我想要的完全相反 So here's first question: 所以这是第一个问题:

1) How to use codecvt thingamajigs to convert my UTF-8 strings to UTF-16 for WinAPI calls, and the UTF-16 results back to UTF-8? 1)如何使用codecvt thingamajigs将我的UTF-8字符串转换为UTF-16进行WinAPI调用,UTF-16结果返回UTF-8?

Another problem: I'm also targetting Linux. 另一个问题:我也在瞄准Linux。 On Linux, there is a very good support for many different locales - and I don't want to be any different. 在Linux上,对许多不同的语言环境有很好的支持 - 我不希望有任何不同。 Hopefully everyone will use UTF-8 on their Linux machines, but there is no strict guarantee of that. 希望每个人都在他们的Linux机器上使用UTF-8,但没有严格的保证。 So I thought it would be a good idea to extend the above Windows-specific behavior and always do UTF-8-to-system-locale-coding. 所以我认为扩展上面特定于Windows的行为并始终执行UTF-8到系统区域设置编码是一个好主意。 Except that I don't see there's any way in C++11's <locale> library to get current system encoding! 除了我没有看到C ++ 11的<locale>库中有任何方法可以获得当前的系统编码! Default std::locale constructor makes specified-by-myself locale, and if I don't do it, it returns classic "C" locale. 默认的std :: locale构造函数使用自己定义的语言环境,如果不这样做,它将返回经典的“C”语言环境。 And there are no other getters I'm aware of. 并且没有其他我知道的吸气剂。 So here's second question: 所以这是第二个问题:

2) How to detect current system locale? 2)如何检测当前系统区域设置? Something in <locale> ? <locale>有什么东西? Maybe some standard C library function, or (less portable but okay in this case) something in POSIX API? 也许一些标准的C库函数,或者(在这种情况下可移动性较差)POSIX API中的东西?

The design of these facilities in the standard library assumes that multibyte character encodings (like UTF-8) are used only for external storage (ie byte sequences in files on disk) and that all characters in memory are uniform in size. 标准库中这些工具的设计假定多字节字符编码(如UTF-8)仅用于外部存储(即磁盘上文件中的字节序列),并且内存中的所有字符大小均匀。 This is so things like std::basic_string<T>::operator[] can behave in a manner consistent with the performance constraints imposed by the standard. 这就像std::basic_string<T>::operator[]这样的行为可以与标准强加的性能约束一致。 So while you can use files encoded in UTF-8 or some other MBCS (like those for Japanese), your strings in memory should be char , char16_t , char32_t or wchar_t . 因此,虽然您可以使用以UTF-8或其他MBCS编码的文件(如日语),但内存中的字符串应为charchar16_tchar32_twchar_t

This is why you aren't finding a match in the standard library for what you want to do because strings in memory aren't intended to be stored in UTF-8. 这就是为什么你没有在标准库中找到你想要做什么的原因,因为内存中的字符串不打算以UTF-8存储。 This is similar to other languages as well, such as Java, where data on disk is interpreted as a stream of bytes and to turn them into strings you need to tell some component the expected character encoding of the byte stream. 这类似于其他语言,例如Java,其中磁盘上的数据被解释为字节流并将它们转换为字符串,您需要告诉某个组件字节流的预期字符编码。 Some operating systems may stuff a UTF-8 string into argv[] , but this is non-standard. 某些操作系统可能会将UTF-8字符串填入argv[] ,但这是非标准的。 This is the reason that the Unicode enabled entry point for WinMain on Windows provides a NUL terminated pointer to wchar_t and not a char* pointing to a UTF-8 encoded string. 这就是为什么Windows上WinMain的Unicode启用入口点提供了一个NUL终止指向wchar_t指针,而不是指向UTF-8编码字符串的char*

IBM's International Components for Unicode library provides a whole set of components that are complementary to, and design to work with, the C++ standard library. IBM的Unicode国际组件库提供了一整套与C ++标准库互补和设计的组件。 I would look at their code conversion facilities. 我会看看他们的代码转换工具。 While the standard defines facilities in <locale> for code conversion, it doesn't guarantee any existence of a code conversion facility to map from UTF-8 to char16_t , char32_t , or wchar_t . 虽然标准在<locale>定义了用于代码转换的工具,但它不保证存在从UTF-8映射到char16_tchar32_twchar_t的代码转换工具。 If such a thing exists, you'll only get it based on the details of your implementation. 如果存在这样的事情,您只能根据实施细节获得。 The ICU library provides this functionality portably for any C++ implementation. ICU库为任何C ++实现提供了可移植的功能。 It is well supported and well used and unlikely to have bugs decoding UTF-8 strings into the appropriate wider-than- char string. 这是很好的支持和良好使用,并不太可能有错误的UTF-8字符串解码成相应的更宽于─ char的字符串。

Konrad mentioned the UTF-8 Anywhere Manifesto in a comment. 康拉德在评论中提到了UTF-8 Anywhere Manifesto。 This was an interesting read and they point you to the Boost.Nowide library (not officially a part of boost yet) to get solutions to the problems you cite above. 这是一个有趣的阅读,它们指向Boost.Nowide库(尚未正式成为提升的一部分),以获得您在上面提到的问题的解决方案。

Please note that my answer is simply a description of the way the existing C++ standard library classes like std::basic_string<T> work. 请注意,我的答案只是描述现有C ++标准库类(如std::basic_string<T>工作方式。 It is not advice against UTF-8, Unicode, or anything else. 它不是针对UTF-8,Unicode或其他任何内容的建议。 The manifesto cited agrees with me that these things simply don't work this way and if you want to use UTF-8 anywhere, then you need something else. 引用的宣言同意我的观点,即这些东西根本无法以这种方式工作,如果你想在任何地方使用UTF-8,那么你需要别的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM