简体   繁体   English

C++ 编译器如何合并相同的字符串文字

[英]How Do C++ Compilers Merge Identical String Literals

How does compiler (MS Visual C++ 2010) combine identical string literals in different cpp source files?编译器(MS Visual C++ 2010)如何在不同的 cpp 源文件中组合相同的字符串文字? For example, if I have the string literal "hello world\n" in src1.cpp and src2.cpp respectively.例如,如果我分别在 src1.cpp 和 src2.cpp 中有字符串文字“hello world\n”。 The compiled exe file will have only 1 "hello world" string literal probably in the constant/readonly section.编译后的 exe 文件可能在常量/只读部分中只有 1 个“hello world”字符串文字。 Is this task done by the linker?这个任务是由 linker 完成的吗?

What I hope to achieve is that I got some modules written in assembly to be used by C++ modules.我希望实现的是,我得到了一些用汇编编写的模块,供 C++ 模块使用。 And these assembly modules contain many long string literal definitions.这些汇编模块包含许多长字符串文字定义。 I know the string literals are identical to some other string literals in the C++ source.我知道字符串文字与 C++ 源中的其他一些字符串文字相同。 If I link my assembly generated obj code with the compiler generated obj code, would these string literals be merged by the linker to remove redundant strings as is the case when all modules are in C++?如果我将我的程序集生成的 obj 代码与编译器生成的 obj 代码链接起来,这些字符串文字是否会被 linker 合并以删除冗余字符串,就像所有模块都在 C++ 中一样?

(Note the following applies only to MSVC) (注意以下仅适用于 MSVC)

My first answer was misleading since I thought that the literal merging was magic done by the linker (and so that the /GF flag would only be needed by the linker).我的第一个答案具有误导性,因为我认为字面合并是由 linker 完成的魔术(因此只有链接器需要/GF标志)。

However, that was a mistake.然而,这是一个错误。 It turns out the linker has little special involvement in merging string literals - what happens is that when the /GF option is given to the compiler, it puts string literals in a "COMDAT" section of the object file with an object name that's based on the contents of the string literal.事实证明,linker 在合并字符串文字方面几乎没有特别的参与 - 发生的情况是,当将/GF选项提供给编译器时,它会将字符串文字放在 object 文件的“COMDAT”部分中,其中 ZA8CFDE6331BD59EB2ACZ 名称基于 C字符串文字的内容。 So the /GF flag is needed for the compile step, not for the link step.因此,编译步骤需要/GF标志,而不是链接步骤。

When you use the /GF option, the compiler places each string literal in the object file in a separate section as a COMDAT object.当您使用/GF选项时,编译器将 object 文件中的每个字符串文字作为 COMDAT object 放在单独的部分中。 The various COMDAT objects with the same name will be folded by the linker (I'm not exactly sure about the semantics of COMDAT, or what the linker might do if objects with the same name have different data).具有相同名称的各种 COMDAT 对象将被 linker 折叠(我不确定 COMDAT 的语义,或者如果具有相同名称的对象具有不同的数据,linker 可能会做什么)。 So a C file that contains所以一个 C 文件包含

char* another_string = "this is a string";

Will have something like the following in the object file:在 object 文件中会有类似下面的内容:

SECTION HEADER #3
  .rdata name
       0 physical address
       0 virtual address
      11 size of raw data
     147 file pointer to raw data (00000147 to 00000157)
       0 file pointer to relocation table
       0 file pointer to line numbers
       0 number of relocations
       0 number of line numbers
40301040 flags
         Initialized Data
         COMDAT; sym= "`string'" (??_C@_0BB@LFDAHJNG@this?5is?5a?5string?$AA@)
         4 byte align
         Read Only

RAW DATA #3
  00000000: 74 68 69 73 20 69 73 20 61 20 73 74 72 69 6E 67  this is a string
  00000010: 00      

with the relocation table wiring up the another_string1 variable name to the literal data.使用重定位表将another_string1变量名连接到文字数据。

Note that the name of the string literal object is clearly based on the contents of the literal string, but with some sort of mangling.请注意,字符串文字 object 的名称显然是基于文字字符串的内容,但带有某种修饰。 The mangling scheme has been partially documented on Wikipedia (see "String constants").修改方案已部分记录在Wikipedia上(请参阅“字符串常量”)。

Anyway, if you want literals in an assembly file to be treated in the same manner, you'd need to arrange for the literals to be placed in the object file in the same manner.无论如何,如果您希望以相同的方式处理程序集文件中的文字,您需要安排将文字以相同的方式放置在 object 文件中。 I honestly don't know what (if any) mechanism the assembler might have for that.老实说,我不知道汇编程序可能有什么(如果有的话)机制。 Placing an object in a "COMDAT" section is probably pretty easy - getting the name of the object to be based on the string contents (and mangled in the appropriate manner) is another story.将 object 放在“COMDAT”部分可能很容易 - 根据字符串内容(并以适当的方式修改)获取 object 的名称是另一回事。

Unless there's some assembly directive/keyword that specifically supports this scenario, I think you might be out of luck.除非有一些专门支持这种情况的汇编指令/关键字,否则我认为您可能不走运。 There certainly might be one, but I'm sufficiently rusty with ml.exe to have no idea, and a quick look at the skimpy MSDN docs for ml.exe didn't have anything jump out.当然可能有一个,但我对ml.exe已经很生疏了,根本不知道,快速浏览一下ml.exe的 MSDN 文档并没有发现任何问题。

However, if you're willing to put the sting literals in a C file and refer to them in your assembly code via externs, it should work.但是,如果您愿意将字符串文字放在 C 文件中并通过 extern 在您的汇编代码中引用它们,它应该可以工作。 However, that's essentially what Mark Ransom advocates in his comments to the question.然而,这基本上是马克·兰森在他对这个问题的评论中所提倡的。

Yes, the process of merging the resources is done by the linker.是的,合并资源的过程是由 linker 完成的。

If your resources in your compiled assembly code are properly tagged as resources, the linker will be able to merge them with compiled C code.如果已编译的汇编代码中的资源被正确标记为资源,则 linker 将能够将它们与已编译的 C 代码合并。

Much may depend on the specific compiler, linker, and how you drive them.很大程度上可能取决于特定的编译器 linker 以及您如何驱动它们。 For example, this code:例如,这段代码:

// s.c
#include <stdio.h>

void f();

int main() {
    printf( "%p\n", "foo" );
    printf( "%p\n", "foo" );
    f();
}

// s2.c
#include <stdio.h>

void f() {
    printf( "%p\n", "foo" );
    printf( "%p\n", "foo" );
}

when compiled as:当编译为:

gcc s.c s2.c

produces:产生:

00403024
00403024
0040302C
0040302C

from which you can see the strings have only been coalesced in individual translation units.从中您可以看到字符串仅在单个翻译单元中合并。

Identical literals, within the same translation unit, are processed during the parsing phase.在解析阶段处理相同翻译单元内的相同文字。 The compiler converts literals in tokens and stores them into a table (for simplicity, assume [token ID, value]).编译器将文字转换为标记并将它们存储到一个表中(为简单起见,假设为 [token ID, value])。 When the compiler encounters the literal the first time, the value is entered into the table.当编译器第一次遇到文字时,该值被输入到表中。 The next encounters use the same literal.接下来的遭遇使用相同的文字。 When generating code, this value is placed into memory and then each access reads this single value (except for those cases where placing the value in the executable code more than once speeds up execution or shortens executable length).生成代码时,将此值放入 memory 中,然后每次访问都读取此单个值(除了在可执行代码中多次放置该值可加快执行速度或缩短可执行长度的情况)。

Duplicate literals in more than one translation unit may be consolidated by the linker. linker 可以合并多个翻译单元中的重复文字。 All identifiers tagged with global access (ie visible from outside the translation unit) will be consolidated if possible.如果可能,将合并所有标记为全局访问(即从翻译单元外部可见)的标识符。 That means that the code will access only version of the symbol.这意味着代码将仅访问符号的版本。

Some build projects place common or global identifiers into (resource) tables, which allow the identifiers to change without changing the executable.一些构建项目将通用或全局标识符放入(资源)表中,这允许在不更改可执行文件的情况下更改标识符。 This is a common practice for GUIs that need to present text translated into different languages.对于需要呈现翻译成不同语言的文本的 GUI,这是一种常见的做法。

Be aware that with some compilers and linkers, they may not perform the consolidation by default.请注意,对于某些编译器和链接器,默认情况下它们可能不会执行合并。 Some may require a command line switch (or an option).有些可能需要命令行开关(或选项)。 Check your compiler documentation to see how it handles duplicate identifiers or text strings.检查您的编译器文档以了解它如何处理重复的标识符或文本字符串。

"/GF (Eliminate Duplicate Strings)" “/GF(消除重复字符串)”

http://msdn.microsoft.com/en-us/library/s0s0asdt.aspx http://msdn.microsoft.com/en-us/library/s0s0asdt.aspx

Assembly language doesn't provide any way to work directly with an anonymous string literal like C or C++ does.汇编语言不提供任何直接处理匿名字符串文字的方法,例如 C 或 C++。

As such, what you almost certainly want to do is define the strings in your assembly code with names.因此,您几乎可以肯定想要做的是在您的汇编代码中用名称定义字符串。 To use those from C or C++, you want to put an extern declaration of the array into a header that you can #include in whatever files need access to them (and in your C++ code, you'll use the names, not the literals themselves): To use those from C or C++, you want to put an extern declaration of the array into a header that you can #include in whatever files need access to them (and in your C++ code, you'll use the names, not the literals他们自己):

foo.asm foo.asm

.model flat, c

.data
    string1 db "This is the first string", 10, 0
    string2 db "This is the second string\n", 10, 0

foo.h:富.h:

extern char string1[];
extern char string2[];

bar.cpp酒吧.cpp

#include "foo.h"

void baz() { std:::cout << string1; }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM