简体   繁体   English

如何在 C++ 中的 UTF-8 上正确使用 std::string?

[英]How do I properly use std::string on UTF-8 in C++?

My platform is a Mac.我的平台是 Mac。 I'm a C++ beginner and working on a personal project which processes Chinese and English.我是一个 C++ 初学者,正在做一个处理中文和英文的个人项目。 UTF-8 is the preferred encoding for this project. UTF-8 是该项目的首选编码。

I read some posts on Stack Overflow, and many of them suggest using std::string when dealing with UTF-8 and avoid wchar_t as there's no char8_t right now for UTF-8.我在 Stack Overflow 上阅读了一些帖子,其中许多帖子建议在处理 UTF-8 时使用std::string并避免使用wchar_t因为现在没有用于 UTF-8 的char8_t

However, none of them talk about how to properly deal with functions like str[i] , std::string::size() , std::string::find_first_of() or std::regex as these function usually returns unexpected results when facing UTF-8.但是,他们都没有讨论如何正确处理str[i]std::string::size()std::string::find_first_of()std::regex等函数,因为这些函数通常会返回意外结果面对 UTF-8 时。

Should I go ahead with std::string or switch to std::wstring ?我应该继续使用std::string还是切换到std::wstring If I should stay with std::string , what's the best practice for one to handle the above problems?如果我应该继续使用std::string ,那么处理上述问题的最佳做法是什么?

Unicode Glossary Unicode 词汇表

Unicode is a vast and complex topic. Unicode 是一个庞大而复杂的话题。 I do not wish to wade too deep there, however a quick glossary is necessary:我不想在那里涉水太深,但是需要一个快速的词汇表:

  1. Code Points : Code Points are the basic building blocks of Unicode, a code point is just an integer mapped to a meaning .代码点:代码点是 Unicode 的基本构建块,代码点只是一个映射到含义的整数。 The integer portion fits into 32 bits (well, 24 bits really), and the meaning can be a letter, a diacritic, a white space, a sign, a smiley, half a flag, ... and it can even be "the next portion reads right to left".整数部分适合 32 位(嗯,实际上是 24 位),其含义可以是字母、变音符号、空格、符号、笑脸、半个标志……甚至可以是“下一部分从右到左阅读”。
  2. Grapheme Clusters : Grapheme Clusters are groups of semantically related Code Points, for example a flag in unicode is represented by associating two Code Points; Grapheme Clusters :Grapheme Clusters 是一组语义相关的 Code Points,例如 unicode 中的一个标志是通过关联两个 Code Points 来表示的; each of those two, in isolation, has no meaning, but associated together in a Grapheme Cluster they represent a flag.这两者中的每一个,孤立地,都没有意义,但在一个字素簇中关联在一起,它们代表一个标志。 Grapheme Clusters are also used to pair a letter with a diacritic in some scripts.在某些脚本中,字素簇还用于将字母与变音符号配对。

This is the basic of Unicode.这是Unicode的基础。 The distinction between Code Point and Grapheme Cluster can be mostly glossed over because for most modern languages each "character" is mapped to a single Code Point (there are dedicated accented forms for commonly used letter+diacritic combinations). Code Point 和 Grapheme Cluster 之间的区别大多可以掩盖,因为对于大多数现代语言,每个“字符”都映射到单个 Code Point(对于常用的字母 + 变音符号组合有专用的重音形式)。 Still, if you venture in smileys, flags, etc... then you may have to pay attention to the distinction.不过,如果您尝试使用笑脸、旗帜等……那么您可能需要注意区别。


UTF Primer UTF 入门

Then, a serie of Unicode Code Points has to be encoded;然后,必须对一系列 Unicode 代码点进行编码; the common encodings are UTF-8, UTF-16 and UTF-32, the latter two existing in both Little-Endian and Big-Endian forms, for a total of 5 common encodings.常见的编码有 UTF-8、UTF-16 和 UTF-32,后两者有 Little-Endian 和 Big-Endian 两种形式,共有 5 种常见编码。

In UTF-X, X is the size in bits of the Code Unit , each Code Point is represented as one or several Code Units, depending on its magnitude:在 UTF-X 中,X 是Code Unit 的比特大小,每个 Code Point 表示为一个或多个 Code Unit,具体取决于其大小:

  • UTF-8: 1 to 4 Code Units, UTF-8:1 到 4 个代码单元,
  • UTF-16: 1 or 2 Code Units, UTF-16:1 或 2 个代码单元,
  • UTF-32: 1 Code Unit. UTF-32:1 个代码单元。

std::string and std::wstring . std::stringstd::wstring

  1. Do not use std::wstring if you care about portability ( wchar_t is only 16 bits on Windows);如果您关心可移植性,请不要使用std::wstringwchar_t在 Windows 上只有 16 位); use std::u32string instead (aka std::basic_string<char32_t> ).改用std::u32string (又名std::basic_string<char32_t> )。
  2. The in-memory representation ( std::string or std::wstring ) is independent of the on-disk representation (UTF-8, UTF-16 or UTF-32), so prepare yourself for having to convert at the boundary (reading and writing).内存中的表示( std::stringstd::wstring )独立于磁盘上的表示(UTF-8、UTF-16 或 UTF-32),因此请准备好在边界处进行转换(阅读和写作)。
  3. While a 32-bits wchar_t ensures that a Code Unit represents a full Code Point, it still does not represent a complete Grapheme Cluster.虽然 32 位wchar_t确保一个代码单元代表一个完整的代码点,但它仍然不代表一个完整的字素簇。

If you are only reading or composing strings, you should have no to little issues with std::string or std::wstring .如果您只是阅读或编写字符串,则std::stringstd::wstring应该没有什么问题。

Troubles start when you start slicing and dicing, then you have to pay attention to (1) Code Point boundaries (in UTF-8 or UTF-16) and (2) Grapheme Clusters boundaries.当您开始切片和切块时,麻烦就开始了,那么您必须注意 (1) 代码点边界(在 UTF-8 或 UTF-16 中)和 (2) Grapheme Clusters 边界。 The former can be handled easily enough on your own, the latter requires using a Unicode aware library.前者可以很容易地由您自己处理,后者需要使用 Unicode 感知库。


Picking std::string or std::u32string ?选择std::string还是std::u32string

If performance is a concern, it is likely that std::string will perform better due to its smaller memory footprint;如果性能是一个问题, std::string可能会因为其较小的内存占用而表现更好; though heavy use of Chinese may change the deal.尽管大量使用中文可能会改变交易。 As always, profile.一如既往,个人资料。

If Grapheme Clusters are not a problem, then std::u32string has the advantage of simplifying things: 1 Code Unit -> 1 Code Point means that you cannot accidentally split Code Points, and all the functions of std::basic_string work out of the box.如果 Grapheme Clusters 不是问题,那么std::u32string有简化事情的好处:1 Code Unit -> 1 Code Point 意味着你不会不小心拆分 Code Points, std::basic_string所有功能都在盒子。

If you interface with software taking std::string or char* / char const* , then stick to std::string to avoid back-and-forth conversions.如果您与采用std::stringchar* / char const*软件接口,则坚持使用std::string以避免来回转换。 It'll be a pain otherwise.不然会很痛。


UTF-8 in std::string . std::string UTF-8。

UTF-8 actually works quite well in std::string . UTF-8 实际上在std::string工作得很好。

Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.大多数操作都是开箱即用的,因为 UTF-8 编码是自同步的并且与 ASCII 向后兼容。

Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point:由于代码点的编码方式,查找代码点不会意外匹配另一个代码点的中间:

  • str.find('\\n') works, str.find('\\n')有效,
  • str.find("...") works for matching byte by byte 1 , str.find("...")适用于逐字节匹配1
  • str.find_first_of("\\r\\n") works if searching for ASCII characters . str.find_first_of("\\r\\n")在搜索 ASCII 字符时有效

Similarly, regex should mostly works out of the box.同样, regex应该大多是开箱即用的。 As a sequence of characters ( "haha" ) is just a sequence of bytes ( "哈" ), basic search patterns should work out of the box.作为一个字符序列( "haha" )仅仅是一个字节序列( "哈" ),基本的搜索模式应该工作的开箱即用。

Be wary, however, of character classes (such as [:alphanum:] ), as depending on the regex flavor and implementation it may or may not match Unicode characters.但是,要警惕字符类(例如[:alphanum:] ),因为根据正则表达式的风格和实现,它可能匹配也可能不匹配 Unicode 字符。

Similarly, be wary of applying repeaters to non-ASCII "characters", "哈?"同样,对非 ASCII 的“字符”、 "哈?"应用中继器时要小心。 may only consider the last byte to be optional;可能只认为最后一个字节是可选的; use parentheses to clearly delineate the repeated sequence of bytes in such cases: "(哈)?"在这种情况下,使用括号来清楚地描绘出重复的字节序列: "(哈)?" . .

1 The key concepts to look-up are normalization and collation; 1查找的关键概念是规范化和整理; this affects all comparison operations.这会影响所有比较操作。 std::string will always compare (and thus sort) byte by byte, without regard for comparison rules specific to a language or a usage. std::string将始终逐字节比较(并因此排序),而不考虑特定于语言或用法的比较规则。 If you need to handle full normalization/collation, you need a complete Unicode library, such as ICU.如果您需要处理完全规范化/整理,则需要一个完整的 Unicode 库,例如 ICU。

std::string and friends are encoding-agnostic. std::string和朋友是编码不可知的。 The only difference between std::wstring and std::string are that std::wstring uses wchar_t as the individual element, not char . std::wstringstd::string之间的唯一区别是std::wstring使用wchar_t作为单个元素,而不是char For most compilers the latter is 8-bit.对于大多数编译器,后者是 8 位的。 The former is supposed to be large enough to hold any unicode character, but in practice on some systems it isn't (Microsoft's compiler, for example, uses a 16-bit type).前者应该足够大以容纳任何 unicode 字符,但实际上在某些系统上并非如此(例如,Microsoft 的编译器使用 16 位类型)。 You can't store UTF-8 in std::wstring ;您不能将 UTF-8 存储在std::wstring that's not what it's designed for.这不是它的设计目的。 It's designed to be an equivalent of UTF-32 - a string where each element is a single Unicode codepoint.它被设计为等效于 UTF-32 - 一个字符串,其中每个元素都是一个 Unicode 代码点。

If you want to index UTF-8 strings by Unicode codepoint or composed unicode glyph (or some other thing), count the length of a UTF-8 string in Unicode codepoints or some other unicode object, or find by Unicode codepoint, you're going to need to use something other than the standard library.如果你想通过 Unicode 代码点或组合的 unicode 字形(或其他东西)索引 UTF-8 字符串,计算 Unicode 代码点或其他一些 unicode 对象中 UTF-8 字符串的长度,或者通过 Unicode 代码点查找,你是将需要使用标准库以外的东西。 ICU is one of the libraries in the field; ICU是该领域的图书馆之一; there may be others.可能还有其他人。

Something that's probably worth noting is that if you're searching for ASCII characters, you can mostly treat a UTF-8 bytestream as if it were byte-by-byte.可能值得注意的一点是,如果您正在搜索 ASCII 字符,您通常可以将 UTF-8 字节流视为逐字节处理。 Each ASCII character encodes the same in UTF-8 as it does in ASCII, and every multi-byte unit in UTF-8 is guaranteed not to include any bytes in the ASCII range.每个 ASCII 字符在 UTF-8 中的编码方式与在 ASCII 中相同,并且 UTF-8 中的每个多字节单元都保证不包含 ASCII 范围内的任何字节。

Both std::string and std::wstring must use UTF encoding to represent Unicode. std::stringstd::wstring必须使用 UTF 编码来表示 Unicode。 On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units);在 macOS 上, std::string是 UTF-8(8 位代码单元), std::wstring是 UTF-32(32 位代码单元); note that the size of wchar_t is platform-dependent.请注意, wchar_t的大小取决于平台。

For both, size tracks the number of code units instead of the number of code points, or grapheme clusters.对于两者, size跟踪代码单元的数量,而不是代码点或字素簇的数量。 (A code point is one named Unicode entity, one or more of which form a grapheme cluster. Grapheme clusters are the visible characters that users interact with, like letters or emojis.) (代码点是一个命名的 Unicode 实体,其中一个或多个形成一个字素簇。字素簇是用户与之交互的可见字符,如字母或表情符号。)

Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code units is often very close to the number of grapheme clusters.虽然我对中文的Unicode表示不熟悉,但是很有可能当你使用UTF-32时,代码单元的数量往往非常接近字素簇的数量。 Obviously, however, this comes at the cost of using up to 4x more memory.然而,显然,这是以使用多达 4 倍的内存为代价的。

The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.最准确的解决方案是使用 Unicode 库(例如 ICU)来计算您需要的 Unicode 属性。

Finally, UTF strings in human languages that don't use combining characters usually do pretty well with find / regex .最后,不使用组合字符的人类语言中的 UTF 字符串通常与find / regex配合得很好。 I'm not sure about Chinese, but English is one of them.我不确定中文,但英文是其中之一。

Consider upgrading to C++20 and std::u8string that is the best thing we have as of 2019 for holding UTF-8.考虑升级到 C++20 和std::u8string ,这是我们在 2019 年拥有的最好的东西来保存 UTF-8。 There are no standard library facilities to access individual code points or grapheme clusters but at least your type is strong enough to at least say it is true UTF-8.没有标准的库工具来访问单个代码点或字素簇,但至少你的类型足够强大,至少可以说它是真正的 UTF-8。

Should I go ahead with std::string or switch to std::wstring ?我应该继续使用std::string还是切换到std::wstring

I would recommend using std::string because wchar_t is non-portable and C++20 char8_t is poorly supported in the standard and not supported by any system APIs at all (and will likely never be because of compatibility reasons).我建议使用std::string因为wchar_t是不可移植的,而 C++20 char8_t在标准中的支持很差,并且根本不受任何系统 API 支持(并且可能永远不会因为兼容性原因)。 On most platforms including macOS that you are using normal char strings are already UTF-8.在包括 macOS 在内的大多数平台上,您使用的普通char字符串已经是 UTF-8。

Most of the standard string operations work with UTF-8 but operate on code units .大多数标准字符串操作使用 UTF-8,但对代码单元进行操作。 If you want a higher-level API you'll have to use something else such as the text library proposed to Boost.如果您想要更高级别的 API,则必须使用其他东西,例如建议给 Boost的文本库

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM