简体   繁体   English

在C和C ++中实现字符串文字串联

[英]Implementation of string literal concatenation in C and C++

AFAIK, this question applies equally to C and C++ AFAIK,这个问题同样适用于C和C ++

Step 6 of the "translation phases" specified in the C standard (5.1.1.2 in the draft C99 standard) states that adjacent string literals have to be concatenated into a single literal. C标准中规定的“翻译阶段”的第6步(C99标准草案中的5.1.1.2)规定,必须将相邻的字符串文字连接成单个文字。 Ie

printf("helloworld.c" ": %d: Hello "
       "world\n", 10);

Is equivalent (syntactically) to: 等同于(语法上):

printf("helloworld.c: %d: Hello world\n", 10);

However, the standard doesn't seem to specify which part of the compiler has to handle this - should it be the preprocessor ( cpp ) or the compiler itself. 但是,标准似乎没有指定编译器的哪个部分必须处理它 - 它应该是预处理器( cpp )还是编译器本身。 Some online research tells me that this function is generally expected to be performed by the preprocessor ( source #1 , source #2 , and there are more), which makes sense. 一些在线研究告诉我,这个函数通常应该由预处理器( 源#1源#2 ,还有更多)执行,这是有道理的。

However, running cpp in Linux shows that cpp doesn't do it: 但是,在Linux中运行cpp表明cpp没有这样做:

eliben@eliben-desktop:~/test$ cat cpptest.c 
int a = 5;

"string 1" "string 2"
"string 3"

eliben@eliben-desktop:~/test$ cpp cpptest.c 
# 1 "cpptest.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "cpptest.c"
int a = 5;

"string 1" "string 2"
"string 3"

So, my question is: where should this feature of the language be handled, in the preprocessor or the compiler itself? 所以,我的问题是:在预处理器或编译器本身中,应该在何处处理该语言的这一特性?

Perhaps there's no single good answer. 也许没有一个好的答案。 Heuristic answers based on experience, known compilers, and general good engineering practice will be appreciated. 基于经验,已知编译器和一般良好工程实践的启发式答案将不胜感激。


PS If you're wondering why I care about this... I'm trying to figure out whether my Python based C parser should handle string literal concatenation (which it doesn't do, at the moment), or leave it to cpp which it assumes runs before it. PS如果你想知道为什么我关心这个......我正在试图弄清楚我的基于Python的C语法分析器是否应该处理字符串文字连接(目前它没有这样做),或者将它留给cpp它假设在它之前运行。

The standard doesn't specify a preprocessor vs. a compiler, it just specifies the phases of translation you already noted. 该标准没有指定预处理器与编译器,它只是指定了您已经注意到的翻译阶段。 Traditionally, phases 1 through 4 were in the preprocessor, Phases 5 though 7 in the compiler, and phase 8 the linker -- but none of that is required by the standard. 传统上,阶段1到阶段4在预处理器中,编译器中的阶段5到阶段7,以及阶段8到链接器 - 但标准不需要这些阶段。

Unless the preprocessor is specified to handle this, it's safe to assume it's the compiler's job. 除非指定预处理器来处理这个问题,否则可以安全地假设它是编译器的工作。

Edit: 编辑:

Your " Ie " link at the beginning of the post answers the question: 帖子开头的“ Ie ”链接回答了以下问题:

Adjacent string literals are concatenated at compile time; 相邻的字符串文字在编译时连接在一起; this allows long strings to be split over multiple lines, and also allows string literals resulting from C preprocessor defines and macros to be appended to strings at compile time ... 这允许将长字符串拆分为多行,并且还允许在编译时C预处理器定义和宏生成的字符串文字附加到字符串...

In the ANSI C standard, this detail is covered in section 5.1.1.2, item (6): 在ANSI C标准中,第5.1.1.2节第(6)节中介绍了这一细节:

5.1.1.2 Translation phases 5.1.1.2翻译阶段
... ...

4. Preprocessing directives are executed and macro invocations are expanded. 4.执行预处理指令并扩展宏调用。 ... ...

5. Each source character set member and escape sequence in character constants and string literals is converted to a member of the execution character set. 5.字符常量和字符串文字中的每个源字符集成员和转义序列都将转换为执行字符集的成员。

6. Adjacent character string literal tokens are concatenated and adjacent wide string literal tokens are concatenated. 6.连接相邻的字符串文字标记,并连接相邻的宽字符串文字标记。

The standard does not define that the implementation must use a pre-processor and compiler, per se. 该标准没有定义实现必须使用预处理器和编译器本身。

Step 4 is clearly a preprocessor responsibility. 第4步显然是预处理者的责任。

Step 5 requires that the "execution character set" be known. 步骤5要求“执行字符集”是已知的。 This information is also required by the compiler. 编译器也需要此信息。 It is easier to port the compiler to a new platform if the preprocessor does not contain platform dependendencies, so the tendency is to implement step 5, and thus step 6, in the compiler. 如果预处理器不包含平台依赖性,则将编译器移植到新平台更容易,因此倾向于在​​编译器中实现步骤5,从而实现步骤6。

There are tricky rules for how string literal concatenation interacts with escape sequences. 字符串文字连接如何与转义序列交互有一些棘手的规则。 Suppose you have 假设你有

const char x1[] = "a\15" "4";
const char y1[] = "a\154";
const char x2[] = "a\r4";
const char y2[] = "al";

then x1 and x2 must wind up equal according to strcmp , and the same for y1 and y2 . 那么x1x2必须根据strcmp相等, y1y2相同。 (This is what Heath is getting at in quoting the translation steps - escape conversion happens before string constant concatenation.) There's also a requirement that if any of the string constants in a concatenation group has an L or U prefix, you get a wide or Unicode string. (这是Heath在引用转换步骤时所遇到的 - 转义转换发生字符串常量连接之前 。)还要求如果串联组中的任何字符串常量具有LU前缀,则会得到一个宽或者Unicode字符串。 Put it all together and it winds up being significantly more convenient to do this work as part of the "compiler" rather than the "preprocessor." 把它们放在一起,作为“编译器”的一部分,而不是“预处理器”,它可以更方便地完成这项工作。

I would handle it in the scanning token part of the parser, so in the compiler. 我会在解析器的扫描令牌部分处理它,所以在编译器中。 It seems more logical. 这似乎更合乎逻辑。 The preprocessor has not to know the "structure" of the language, and in fact it ignores it usually so that macros can generate uncompilable code. 预处理器不知道语言的“结构”,事实上它通常会忽略它,因此宏可以生成不可编译的代码。 It handles nothing more than what it is entitled to handle by directives that are specifically addressed to it ( # ... ), and the "consequences" of them (like those of a #define xh , which would make the preprocessor change a lot of x into h) 它只处理它有权处理的指令( # ... ),以及它们的“后果”(就像#define xh ,它会使预处理器发生很大的变化)。 x成h)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM