简体   繁体   English

打印通用字符

[英]Printing universal characters

Can anyone explain why universal character literals (eg "\±") are being encoded into char strings as UTF-8? 任何人都可以解释为什么通用字符文字(例如“\\ u00b1”)被编码为字符串为UTF-8? Why does the following print the plus/minus symbol? 为什么以下打印加号/减号?

#include <iostream>
#include <cstring>
int main()
{
  std::cout << "\u00b1" << std::endl;
  return 0;
}

Is this related to my current locale? 这与我当前的语言环境有关吗?

2.13.2. 2.13.2。 [...] [...]

5/ A universal-character-name is translated to the encoding, in the execution character set, of the character named. 5 /通用字符名称被转换为名为的字符在执行字符集中的编码。 If there is no such encoding, the universal-character-name is translated to an implementation defined encoding. 如果没有这样的编码,则将通用字符名称转换为实现定义的编码。 [ Note : in translation phase 1, a universal-character-name is introduced whenever an actual extended character is encountered in the source text. [ 注意 :在翻译阶段1中,只要在源文本中遇到实际的扩展字符,就会引入通用字符名称。 Therefore, all extended characters are described in terms of universal-character-names. 因此,所有扩展字符都以通用字符名称来描述。 However, the actual compiler implementation may use its own native character set, so long as the same results are obtained. 但是,实际的编译器实现可以使用其自己的本机字符集,只要获得相同的结果即可。 ] ]

and

2.2. 2.2。 [...] The values of the members of the execution character sets are implementation-defined, and any additional members are locale-specific. [...]执行字符集成员的值是实现定义的,任何其他成员都是特定于语言环境的。

In short, the answer to your question is in your compiler documentation. 简而言之,您的问题的答案在您的编译器文档中。 However: 然而:

2.2. 2.2。 2/ The character designated by the universal-character-name \\UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; 2 / universal-name-name \\ UNNNNNNNN指定的字符是ISO / IEC 10646中字符短名称为NNNNNNNN的字符; the character designated by the universal-character-name \\uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. 通用字符名称\\ uNNNN指定的字符是ISO / IEC 10646中字符短名称为0000NNNN的字符。 If the hexadecimal value for a universal character name is less than 0x20 or in the range 0x7F-0x9F (inclusive), or if the universal character name designates a character in the basic source character set, then the program is illformed. 如果通用字符名称的十六进制值小于0x20或在0x7F-0x9F(包括)范围内,或者通用字符名称指定基本源字符集中的字符,则程序不正确。

so you are guaranteed that the character you name is translated into an implementation defined encoding, possibly locale specific. 因此,您可以保证您命名的字符被转换为实现定义的编码,可能是特定于语言环境的。

is the ± symbol as that is the correct unicode representation regardless of locale. ±符号,因为无论语言环境如何,它都是正确的unicode表示。

Your code at ideone, see here . 你在ideone的代码,请看这里

String literals eg "abcdef" are simple byte arrays (of type const char[] ). 字符串文字例如"abcdef"是简单的字节数组(类型为const char[] )。 Compiler encodes non-ASCII characters in them into something that is implementation-defined. 编译器将非ASCII字符编码为实现定义的内容。 Rumors say Visual C++ uses current Windows' ANSI codepage, and GCC uses UTF-8, so you're probably on GCC :) 谣言说Visual C ++使用当前Windows的ANSI代码页,而GCC使用UTF-8,所以你可能在GCC :)

So, \ꯍ is interpreted by compiler at compile time and converted into the corresponding value in that encoding. 因此, \ꯍ在编译时由编译器解释并转换为该编码中的相应值。 Ie it can put one or more bytes into the byte array: 即它可以将一个或多个字节放入字节数组中:

sizeof("\uFE58z") == 3 // visual C++ 2010
sizeof("\uFE58z") == 5 // gcc 4.4 mingw

And yet, how cout will print the byte array, depends on locale settings. 然而, cout将如何打印字节数组取决于区域设置。 You can change stream's locale via std::ios_base::imbue() call. 您可以通过std::ios_base::imbue()调用更改流的语言环境。

C++ Character Sets C ++字符集

With the standardization of C++, it's useful to review some of the mechanisms included in the language for dealing with character sets. 通过C ++的标准化,有必要回顾一下处理字符集的语言中包含的一些机制。 This might seem like a very simple issue, but there are some complexities to contend with. 这似乎是一个非常简单的问题,但有一些复杂性需要应对。

The first idea to consider is the notion of a "basic source character set" in C++. 首先要考虑的是C ++中“基本源字符集”的概念。 This is defined to be: 这被定义为:

    all ASCII printing characters 041 - 0177, save for @ $ ` DEL

    space

    horizontal tab

    vertical tab

    form feed

    newline

or 96 characters in all. 或者总共96个字符。 These are the characters used to compose a C++ source program. 这些是用于组成C ++源程序的字符。

Some national character sets, such as the European ISO-646 one, use some of these character positions for other letters. 一些国家字符集,例如欧洲ISO-646字符集,将其中一些字符位置用于其他字母。 The ASCII characters so affected are: 受影响的ASCII字符是:

    [ ] { } | \

To get around this problem, C++ defines trigraph sequences that can be used to represent these characters: 为了解决这个问题,C ++定义了可用于表示这些字符的三字符序列:

    [       ??(

    ]       ??)

    {       ??<

    }       ??>

    |       ??!

    \       ??/

    #       ??=

    ^       ??'

    ~       ??-

Trigraph sequences are mapped to the corresponding basic source character early in the compilation process. 在编译过程的早期,Trigraph序列被映射到相应的基本源字符。

C++ also has the notion of "alternative tokens", that can be used to replace tokens with others. C ++也有“替代令牌”的概念,可用于替换其他令牌。 The list of tokens and their alternatives is this: 令牌及其替代品列表如下:

    {       <%

    }       %>

    [       <:

    ]       :>

    #       %:

    ##      %:%:

    &&      and

    |       bitor

    ||      or

    ^       xor

    ~       compl

    &       bitand

    &=      and_eq

    |=      or_eq

    ^=      xor_eq

    !       not

    !=      not_eq

Another idea is the "basic execution character set". 另一个想法是“基本执行字符集”。 This includes all of the basic source character set, plus control characters for alert, backspace, carriage return, and null. 这包括所有基本源字符集,以及alert,backspace,回车和null的控制字符。 The "execution character set" is the basic execution character set plus additional implementation-defined characters. “执行字符集”是基本执行字符集以及其他实现定义的字符。 The idea is that a source character set is used to define a C++ program itself, while an execution character set is used when a C++ application is executing. 我们的想法是源字符集用于定义C ++程序本身,而执行字符集则在执行C ++应用程序时使用。

Given this notion, it's possible to manipulate additional characters in a running program, for example characters from Cyrillic or Greek. 鉴于此概念,可以在正在运行的程序中操纵其他字符,例如来自西里尔文或希腊文的字符。 Character constants can be expressed using any of: 字符常量可以使用以下任何一种表示:

    \137            octal

    \xabcd          hexadecimal

    \u12345678      universal character name (ISO/IEC 10646)

    \u1234          -> \u00001234

This notation uses the source character set to define execution set characters. 此表示法使用源字符集来定义执行集字符。 Universal character names can be used in identifiers (if letters) and in character literals: 通用字符名称可用于标识符(如果是字母)和字符文字:

    '\u1234'

    L'\u2345'

The above features may not yet exist in your local C++ compiler. 您的本地C ++编译器中可能尚不存在上述功能。 They are important to consider when developing internationalized applications. 在开发国际化应用程序时,必须考虑它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM