简体繁体 English

C ++ 11：普通字符串文字和UTF-8字符串文字之间的区别示例？

[英]C++11: Example of difference between ordinary string literal and UTF-8 string literal?

原文 2013-02-04 02:42:18 8 1 c++/ utf-8/ character-encoding/ c++11/ string-literals

A string literal that does not begin with an encoding-prefix is an ordinary string literal, and is initialized with the given characters. 不以encoding-prefix开头的字符串文字是普通的字符串文字，并使用给定的字符进行初始化。

A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8. 以u8开头的字符串文字，例如u8“asdf”，是一个UTF-8字符串文字，并使用UTF-8编码的给定字符进行初始化。

I don't understand the difference between an ordinary string literal and a UTF-8 string literal. 我不明白普通字符串文字和UTF-8字符串文字之间的区别。

Can someone provide an example of a situation where they are different? 有人可以提供一个他们不同的情况的例子吗？ (Cause different compiler output) （导致不同的编译器输出）

(I mean from the POV of the standard, not any particular implementation) （我的意思是从标准的POV，而不是任何特定的实现）

Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set. 字符文字或字符串文字中的每个源字符集成员，以及字符文字或非原始字符串文字中的每个转义序列和通用字符名称，都将转换为执行字符集的相应成员。

1 个解决方案

The C and C++ languages allow a huge amount of latitude in their implementations. C和C ++语言在其实现中允许大量的自由度。 C was written long before UTF-8 was "the way to encode text in single bytes": different systems had different text encodings. 在UTF-8是“以单字节编码文本的方式”之前很久就编写了C：不同的系统具有不同的文本编码。

So what the byte values are for a string in C and C++ are really up to the compiler. 那么C和C ++中字符串的字节值究竟取决于编译器。 'A' is whatever the compiler's chosen encoding is for the character A , which may not agree with UTF-8. 'A'是编译器为字符A选择的编码，可能与UTF-8不一致。

C++ has added the requirement that real UTF-8 string literals must be supported by compilers. C ++增加了编译器必须支持真正的UTF-8字符串文字的要求。 The bit value of u8"A"[0] is fixed by the C++ standard through the UTF-8 standard, regardless of the preferred encoding of the platform the compiler is targeting. u8"A"[0]的位值由C ++标准通过UTF-8标准确定，无论编译器所针对的平台的首选编码如何。

Now, much as most platforms C++ targets use 2's complement integers, most compilers have character encodings that are mostly compatible with UTF-8. 现在，就像大多数平台C ++目标使用2的补码整数一样，大多数编译器都具有大多数与UTF-8兼容的字符编码。 So for strings like "hello world" , u8"hello world" will almost certainly be identical. 因此，对于像"hello world"这样的字符串， u8"hello world"几乎肯定会是相同的。

For a concrete example, from man gcc 举一个具体的例子，来自man gcc

-fexec-charset=charset -fexec-字符集的字符集=

Set the execution character set, used for string and character constants. 设置执行字符集，用于字符串和字符常量。 The default is UTF-8. 默认值为UTF-8。 charset can be any encoding supported by the system's iconv library routine. charset可以是系统的iconv库例程支持的任何编码。

-finput-charset=charset -finput-字符集的字符集=

Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. 设置输入字符集，用于从输入文件的字符集转换为GCC使用的源字符集。 If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. 如果区域设置未指定，或GCC无法从区域设置获取此信息，则默认值为UTF-8。 This can be overridden by either the locale or this command line option. 这可以通过语言环境或此命令行选项覆盖。 Currently the command line option takes precedence if there's a conflict. 目前，如果存在冲突，命令行选项优先。 charset can be any encoding supported by the system's iconv library routine. charset可以是系统的iconv库例程支持的任何编码。