简体   繁体   English

将字符串文字分配给char数组,字符串文字如何复制到堆栈中?

[英]Assigning a string literal to a char array, how is the string literal copied onto the stack?

I understand when you do char array[] = "string", the string literal "string" gets copied from the data segment to the stack. 我理解当你执行char array [] =“string”时,字符串文字“string”会从数据段复制到堆栈。 Is the string literal copied over character by character? 字符串文字是否逐字符复制? Or the compiler gets the start and end address of the string literal and copies the entire string to the stack at one time? 或者编译器获取字符串文字的起始和结束地址,并将整个字符串一次复制到堆栈中?

thanks 谢谢

The compiler does anything it “wants” to do, as long as the observed result is the same. 只要观察到的结果相同,编译器就会执行它“想要”执行的任何操作。 Sometimes there is no copy at all. 有时根本没有副本。

The C standard does not specify how the copy is done, so the C implementation is free to achieve the result by any means. C标准没有指定副本的完成方式,因此C实现可以通过任何方式自由地实现结果。 The only requirement the C standard imposes is that the observable results (such as text written to standard output) must be as defined. C标准强制要求的唯一要求是可观察的结果(例如写入标准输出的文本)必须按照定义。

When engineers are designing a C implementation to be high quality, they will spend some time considering the best ways to copy a string in such a situation, and they will seek to design a compiler that chooses the best way in each situation. 当工程师设计高质量的C实现时,他们会花一些时间考虑在这种情况下复制字符串的最佳方法,并且他们将寻求设计一个在每种情况下选择最佳方式的编译器。 A short string might be built in place by using “move immediate value” instructions. 可以使用“立即移动值”指令构建一个短字符串。 A long string might be copied by a call to memcpy . 可以通过调用memcpy来复制长字符串。 An intermediate string might be copied by an inlined call to memcpy , effectively a few instructions that move several bytes each. 中间字符串可能通过内联调用复制到memcpy ,实际上是一些指令,每个指令移动几个字节。

When engineers are designing a cheap C implementation, something that just gets the job done so that code can be ported to machine but does not need to be fast, they will do whatever is easiest for them. 当工程师设计一个廉价的C实现时,只需要完成工作以便代码可以移植到机器但不需要快速,它们将为他们做最简单的事情。

Sometimes a compiler will not copy the string at all: If the compiler can tell that you do not need a copy, there is no reason to make a copy. 有时编译器根本不会复制字符串:如果编译器可以告诉您不需要副本,则没有理由进行复制。 For example, if the compiler sees that you merely pass the string to printf and do not modify it at all, then the compiler get the same result without making a copy by passing the original to printf . 例如,如果编译器发现您只是将字符串传递给printf而根本不修改它,那么编译器会通过将原始文件传递给printf来获得相同的结果而无需复制。

I'm not sure what you mean by your distinction between "character by character" and "entire string" copying methods. 我不确定你的“逐字符”和“整个字符串”复制方法之间的区别是什么意思。 A string is typically not a machine-level entity, which means that there's no possibility of it being copied as "entire string". 字符串通常不是机器级实体,这意味着它不可能被复制为“整个字符串”。 How do you expect this to happen? 您如何期待这种情况发生?

String will always be copied "character by character", at least conceptually. 字符串将始终按字符“逐字符”复制,至少在概念上是这样。 Now, when it comes to copying extended memory regions, the copying process can be optimized by the compiler through performing word-by-word (instead of byte-by-byte) copying whenever possible. 现在,在复制扩展内存区域时,编译器可以通过逐个字(而不是逐字节)复制来优化复制过程。 A similar optimization might be implemented at the processor micro-architecture level. 可以在处理器微架构级别实现类似的优化。

But anyway, in general case the copying is implemented as an iterative process, not as some atomic operation on the "entire string". 但无论如何,在一般情况下,复制是作为迭代过程实现的,而不是对“整个字符串”的某些原子操作。

On top of that, a smart compiler might realize that in some cases the copying is not necessary at all. 最重要的是,智能编译器可能会意识到在某些情况下根本不需要复制。 For example, if your code does not modify the array object and does not rely on its address identity, the compiler might simply decide to use the original string literal directly, without doing any copying at all (ie basically quietly replace your char array[] = "string" with const char *array = "string" ) 例如,如果您的代码不修改array对象并且不依赖于其地址标识,则编译器可能只是决定直接使用原始字符串文字,而根本不进行任何复制(即基本上安静地替换您的char array[] = "string" with const char *array = "string"

There's no reason to think there's a copy at all. 没有理由认为有副本。

Take the following code for example. 以下面的代码为例。

int main() {
  char c[] = "hi";
}

For me this produces (unoptimized) assembly: 对我来说,这产生(未经优化的)组装:

main:
    pushq   %rbp
    movq    %rsp, %rbp
    movw    $26984, -16(%rbp)
    movb    $0, -14(%rbp)
    movl    $0, %eax
    popq    %rbp
    ret

The array's memory is initialized by setting it to the value 26984. This value happens to be represented by two bytes 0x68 and 0x69, which are the ascii values of 'h' and 'i'. 通过将数组设置为值26984来初始化数组的内存。该值恰好由两个字节0x68和0x69表示,它们是'h'和'i'的ascii值。 There is no data segment representation of the string at all, and the array is not initialized by copying anything into it character-by-character, or by any other clever method of copying. 字符串根本没有数据段表示,并且不会通过逐个字符地复制任何内容或通过任何其他巧妙的复制方法来初始化数组。

Of course this is only one compiler's implementation (g++ 4.8) and other compilers can do whatever they want so long as the conform to the language specification. 当然这只是一个编译器的实现(g ++ 4.8),只要符合语言规范,其他编译器就可以做任何他们想做的事情。

This depends on the compiler and target architecture. 这取决于编译器和目标体系结构。

There could be very simple target architectures, like microcontrollers, which don't have instructions to support copying blocks of memory. 可能有非常简单的目标体系结构,如微控制器,它们没有支持复制内存块的指令。 There probably exist very simple compilers designed for teaching, which generate byte-by-byte copying even on architectures which support more effective methods. 可能存在设计用于教学的非常简单的编译器,即使在支持更有效方法的体系结构上也会生成逐字节复制。

However, you can assume that production-level compilers would do the reasonable thing and produce the fastest code possible for most popular architectures in this case, and you don't really need to worry about it. 但是,您可以假设生产级编译器会做出合理的事情,并为这种情况下最流行的架构生成最快的代码,而您实际上并不需要担心它。

Still, the best way to check would be to read the assembly the compiler generates. 但是,检查的最佳方法是读取编译器生成的程序集。

Take this test code (stack_array_init.c): 拿这个测试代码(stack_array_init.c):

#include <stdio.h>

int
main()
{
    char a[]="Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed\n"
             "do eiusmod tempor incididunt ut labore et dolore magna aliqua.\n";

    printf("%s", a);

    return 0;
}

And compile it into assembly with gcc with optimization for size (to have less to read), like this: 并使用gcc将其编译为程序集,并优化大小(以便更少阅读),如下所示:

gcc -Os -S stack_array_init.c

Here is the output for x86-64: 这是x86-64的输出:

        .file   "stack_array_init.c"
        .section        .rodata.str1.1,"aMS",@progbits,1
.LC1:
        .string "%s"
.LC0:
        .string "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed\ndo eiusmod tempor incididunt ut labore et dolore magna aliqua.\n"
        .section        .text.startup,"ax",@progbits
        .globl  main
        .type   main, @function
main:
.LFB0:
        .cfi_startproc
        subq    $136, %rsp
        .cfi_def_cfa_offset 144
        movl    $.LC0, %esi
        movl    $126, %ecx
        leaq    2(%rsp), %rdi
        xorl    %eax, %eax
        rep movsb
        leaq    2(%rsp), %rsi
        movl    $.LC1, %edi
        call    printf
        xorl    %eax, %eax
        addq    $136, %rsp
        .cfi_def_cfa_offset 8
        ret
        .cfi_endproc
.LFE0:
        .size   main, .-main
        .ident  "GCC: (Debian 4.7.2-5) 4.7.2"
        .section        .note.GNU-stack,"",@progbits

Here, "rep movsb" is the instruction which copies the string to the stack. 这里,“rep movsb”是将字符串复制到堆栈的指令。

Here is an excerpt from an ARMv4 assembly (which might be easier to read): 以下是ARMv4程序集的摘录(可能更容易阅读):

main:
    @ Function supports interworking.
    @ args = 0, pretend = 0, frame = 128
    @ frame_needed = 0, uses_anonymous_args = 0
    str lr, [sp, #-4]!
    sub sp, sp, #132
    mov r2, #126
    ldr r1, .L2
    mov r0, sp
    bl  memcpy
    mov r1, sp
    ldr r0, .L2+4
    bl  printf
    mov r0, #0
    add sp, sp, #132
    ldr lr, [sp], #4
    bx  lr
.L3:
    .align  2
.L2:
    .word   .LC0
    .word   .LC1
    .size   main, .-main
    .section    .rodata.str1.4,"aMS",%progbits,1
    .align  2
.LC1:
    .ascii  "%s\000"
    .space  1
.LC0:
    .ascii  "Lorem ipsum dolor sit amet, consectetur adipisicing"
    .ascii  " elit, sed\012do eiusmod tempor incididunt ut labor"
    .ascii  "e et dolore magna aliqua.\012\000"
    .ident  "GCC: (Debian 4.6.3-14) 4.6.3"
    .section    .note.GNU-stack,"",%progbits

To my understanding of ARM assembly, this looks like the code is calling memcpy to copy the string into the stack array. 根据我对ARM程序集的理解,这看起来像代码调用memcpy将字符串复制到堆栈数组中。 Although this doesn't show the assembly for memcpy, I would expect it to use one of the fastest methods available. 虽然这没有显示memcpy的程序集,但我希望它能使用最快的方法之一。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM