简体   繁体   English

c ++编译器如何将转义序列转换为实际字节?

[英]how does c++ compiler convert escape sequence to actual bytes?

GCC compiler offer a compiler option (-fexec-charset=option) so you can configure the encoding of your char and string literals, so it convert your string from the source charset ( UTF-8 by default ) to the execution charset. GCC编译器提供了一个编译器选项(-fexec-charset = option),因此您可以配置char和字符串文字的编码,因此它将您的字符串从源字符集(默认为UTF-8)转换为执行字符集。

So I want to know is it this conversion from source charset to execution charset that result the escape sequences to be replaced by their correspendent code point ? 所以我想知道从源字符集到执行字符集的这种转换导致转义序列被它们相关的代码点替换?

Exmple. Exmple。

cout << "hello \x60 "; // \x60 replaced by byte 0x60
cout << "hello \n"; // \n replaced by 0xA0

and also in the first example this character \\x60 is encoding independent whereas in the second example, this character '\\n' byte representation is encoding dependent, and also platform dependent (it will change to \\r\\n in windows, and remain \\n on UNIX). 并且在第一个示例中,此字符\\x60是独立编码的,而在第二个示例中,此字符'\\ n'字节表示依赖于编码,并且还依赖于平台(它将在Windows中更改为\\ r \\ n,并保留\\ n在UNIX上)。

Though you apparently don't quite realize it, you're really asking about two entirely separate conversions. 虽然你显然没有意识到这一点,但你真的要问两个完全独立的转换。

The first one is converting escape sequences in the compiler. 第一个是在编译器中转换转义序列。 That's pretty straightforward -- when it sees a \\ in (for example) a string, it looks at the next character and produces a single byte of output for the two (or, depending on the exact input, it might be one byte of output from more than two characters of input, such as something like \\001 ). 这非常简单 - 当它在(例如)字符串中看到\\时,它会查看下一个字符并为两个字符生成单个字节的输出(或者,根据确切的输入,它可能是输出的一个字节来自两个以上的输入字符,例如\\001 )。

The conversion from \\n to \\r\\n on Windows is entirely separate -- that happens during output to a stream -- specifically a text-mode stream. 在Windows上从\\n\\r\\n的转换是完全独立的 - 在输出到流期间 - 特别是文本模式流。 That conversion isn't done by the compiler proper at all, but by code in the iostreams library. 转换完全不是由编译器完成的,而是由iostreams库中的代码完成的。

In case you really care about the first one, here's some code I wrote years ago that does roughly the same thing as a compiler does (though despite the C++ tag, this code is pure C): 如果你真的关心第一个,那么我在几年前写的一些代码与编译器大致相同(尽管有C ++标签,这段代码是纯C):

#include <string.h>
#include <stdio.h>
#include "snip_str.h"

char *translate(char *string)
{
      char *here=string;
      size_t len=strlen(string);
      int num;
      int numlen;

      while (NULL!=(here=strchr(here,'\\')))
      {
            numlen=1;
            switch (here[1])
            {
            case '\\':
                  break;

            case 'r':
                  *here = '\r';
                  break;

            case 'n':
                  *here = '\n';
                  break;

            case 't':
                  *here = '\t';
                  break;

            case 'v':
                  *here = '\v';
                  break;

            case 'a':
                  *here = '\a';
                  break;

            case '0':
            case '1':
            case '2':
            case '3':
            case '4':
            case '5':
            case '6':
            case '7':
                  numlen = sscanf(here,"%o",&num);
                  *here = (char)num;
                  break;

            case 'x':
                  numlen = sscanf(here,"%x",&num);
                  *here = (char) num;
                  break;
            }
            num = here - string + numlen;
            here++;
            memmove(here,here+numlen,len-num );
      }
      return string;
}

After searching on the web, I now know the answer to my question. 在网上搜索后,我现在知道了我的问题的答案。 So I will try to explain it for anyone who is wondering about the mechanism of handling escape sequence in c++. 因此,我将尝试为那些想知道在c ++中处理转义序列的机制的人解释它。

When You write your code on a file you specify your file charset ( Windows-1252 , ISO-8859-1 , UTF-8 , UTF-16 , UTF-16BE , UTF-16LE ...) which will map the characters inside your file to their correspondent code point then get encoded using the charset that you've specified to a stream of bytes to be saved on the hard drive. 当您在文件上编写代码时,您指定了文件字符集( Windows-1252ISO-8859-1UTF-8UTF-16UTF-16BEUTF-16LE ......),它们将映射您的文件内的字符将文件发送到其对应的代码点,然后使用您指定的字符集对要保存在硬盘驱动器上的字节流进行编码。
When you try to compile your source code file, if you didn't specify what is your file encoding using -finput-charset=option compiler option, the compiler will assume that your file is encoded using UTF-8 . 当您尝试编译源代码文件时,如果未使用-finput-charset=option编译器选项指定文件编码是什么,编译器将假定您的文件使用UTF-8编码。 In both cases, the first thing the C PreProcessor (CPP) will do is convert your file into the source charset which is UTF-8. 在这两种情况下, C PreProcessor (CPP)将做的第一件事是将您的文件转换为源字符集 ,即UTF-8。

After the CPP is complete, string and character constants are converted again to the execution charset , by default it matches the source charset UTF-8 but you can change it using -exec-charset=option compiler option. CPP完成后,字符串和字符常量将再次转换为执行字符集 ,默认情况下它与源字符集 UTF-8匹配,但您可以使用-exec-charset=option编译器选项更改它。 Until now, everything is clear and we didn't talk about escape sequence since they get handled differently. 到目前为止,一切都很清楚,我们没有谈论转义序列,因为它们的处理方式不同。

There is two kinds of escape sequences each get handled differently when the string get converted from the source charset to the execution charset . 当字符串从源字符集转换为执行字符集时,有两种转义序列的处理方式不同。 The first type is octal or hexadecimal escape sequences like \\xA1 or \\45 , the second type is escape sequence that get represented using a backslash followed by a character like \\r or \\n . 第一种类型是octal or hexadecimal转义序列,如\\xA1 or \\45 ,第二种类型是escape sequence that get represented using a backslash followed by a character\\r or \\n

Octal and Hexadecimal escape sequence values are independent from the execution charset , which mean they don't get converted from source charset to execution charset , for example \\xA1 has the value A1 regardless of the current execution charset . 八进制和十六进制转义序列值独立于执行字符集 ,这意味着它们不会从源字符集转换为执行字符集 ,例如\\xA1具有值A1而不管当前执行字符集
The remaining escape sequences values depend on the execution charset , for example '\\n' will get first mapped to the correspondent character in the source charset in this case it is 0A in UTF-8 then converted to execution charset , so for example if the user have set -fexec-charset=UTF-16BE then '\\n' will be 0A in source charset then 00 0A after the source to execution charset conversion. 其余的转义序列值取决于执行字符集 ,例如'\\n'将首先映射到源字符集中的对应字符,在这种情况下,它在UTF-80A ,然后转换为执行字符集 ,例如,如果用户设置-fexec-charset=UTF-16BE然后'\\n'将在源字符集中为0A ,然后在源到执行字符集转换后为00 0A

The Line Feed escape character \\n is even platform dependent, in windows OS the output library will replace \\n=0A with \\r\\n=10 0A , in Unix it will remain \\n=0A . 换行符转义字符\\n甚至是平台相关的,在Windows操作系统中,输出库将用\\r\\n=10 0A替换\\n=0A \\r\\n=10 0A ,在Unix中它将保持\\n=0A Note that this replacement happen after characters and strings conversion from source charset to execution charset , otherwise we will get different result. 请注意,此替换发生在字符串和字符串从source charset转换为execution charset ,否则我们将得到不同的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM