简体   繁体   中英

Difference in for loops of old and new GCC's generated assembly code

I am reading a chapter about assembly code, which has an example. Here is the C program:

int main()
{
    int i;
    for(i=0; i < 10; i++)
    {
        puts("Hello, world!\n");
    }
    return 0;
}

Here is the assembly code provided in the book:

0x08048384 <main+0>:    push ebp
0x08048385 <main+1>:    mov ebp,esp
0x08048387 <main+3>:    sub esp,0x8
0x0804838a <main+6>:    and esp,0xfffffff0
0x0804838d <main+9>:    mov eax,0x0
0x08048392 <main+14>:   sub esp,eax
0x08048394 <main+16>:   mov DWORD PTR [ebp-4],0x0
0x0804839b <main+23>:   cmp DWORD PTR [ebp-4],0x9
0x0804839f <main+27>:   jle 0x80483a3 <main+31>
0x080483a1 <main+29>:   jmp 0x80483b6 <main+50>
0x080483a3 <main+31>:   mov DWORD PTR [esp],0x80484d4
0x080483aa <main+38>:   call 0x80482a8 <_init+56>
0x080483af <main+43>:   lea eax,[ebp-4]
0x080483b2 <main+46>:   inc DWORD PTR [eax]
0x080483b4 <main+48>:   jmp 0x804839b <main+23>

Here is part of my version:

   0x0000000000400538 <+8>: mov    DWORD PTR [rbp-0x4],0x0
=> 0x000000000040053f <+15>:    jmp    0x40054f <main+31>
   0x0000000000400541 <+17>:    mov    edi,0x4005f0
   0x0000000000400546 <+22>:    call   0x400410 <puts@plt>
   0x000000000040054b <+27>:    add    DWORD PTR [rbp-0x4],0x1
   0x000000000040054f <+31>:    cmp    DWORD PTR [rbp-0x4],0x9
   0x0000000000400553 <+35>:    jle    0x400541 <main+17>

My question is, why is in case of the book's version it assigns 0 to the variable( mov DWORD PTR [ebp-4],0x0 ) and compares just after that with cmp but in my version, it assigns and then it does jmp 0x40054f <main+31> where the cmp is?

It seems more logical to assign and compare without any jump , because it is like that inside for loop.

Why did your compiler do something different than a different compiler that was used in the book? Because it's a different compiler. No two compilers will compile all code the same, even very trivial code can be compiled vastly different by two different compilers or even two versions of the same compiler. And it's quite obvious both were compiled without any optimization, with optimization the results would be even more different.

Let's reason about what the for loop does.

for (i = 0; i < 10; i++) {
    code;
}

Let's write it a little bit closer to the assembler that was generated by the first compiler generated.

        i = 0;
start:  if (i > 9) goto out;
        code;
        i++;
        goto start;
out:

Now the same thing for "my version":

        i = 0;
        goto cmp;
start:  code;
        i++;
cmp:    if (i < 10) goto start;

The clear difference here is that in "my version" there will only be one jump executed within the loop while the book version has two. It's a quite common way to generate loops in more modern compilers because of how sensitive CPUs are to branches. Many compilers will generate code like this even without any optimizations because it performs better in most cases. Older compilers didn't do this because either they didn't think about it or this trick was performed in an optimization stage which wasn't enabled when compiling the code in the book.

Notice that a compiler with any kind of optimization enabled wouldn't even do that first goto cmp because it would know that it was unnecessary. Try compiling your code with optimization enabled (you say you use gcc, give it the -O2 flag) and see how vastly different it will look after that.

You didn't quote the full assembly-language body of the function from your textbook, but my psychic powers tell me that it looked something like this (also, I've replaced literal addresses with labels, for clarity):

    # ... establish stack frame ...

    mov    DWORD PTR [rbp-4],0x0
    cmp    DWORD PTR [rbp-4],0x9
    jle    .L0
.L1:
    mov    rdi, .Lconst0
    call   puts
    add    DWORD PTR [rbp-0x4],0x1
    cmp    DWORD PTR [rbp-0x4],0x9
    jle    .L1
.L0:

    # ... return from function ...

GCC has noticed that it can eliminate the initial cmp and jle by replacing them with an unconditional jmp down to the cmp at the bottom of the loop, so that is what it did. This is a standard optimization called loop inversion . Apparently it does this even with the optimizer off; with optimization on, it would also have noticed that the initial comparison must be false, hoisted out the address load, placed the loop index in a register, and converted to a count-down loop so it could eliminate the cmp altogether; something like this:

    # ... establish stack frame ...

    mov    ebx, 10
    mov    r14, .Lconst0
.L1:
    mov    rdi, r14
    call   puts
    dec    ebx
    jne    .L1

    # ... return from function ...

(The above was actually generated by Clang. My version of GCC did something else, equally sensible but harder to explain .)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM