Complex compiler output for simple constructor

Question

I have a struct X with two 64-bit integer members, and a constructor:

struct X
{
    X(uint64_t a, uint64_t b)
    {
        a_ = a; b_ = b;
    }

    uint64_t a_, b_;
};

When I look at the compiler output (x86-64 gcc 8.3 and x86-64 clang 8.0.0, on 64-bit Linux), with no optimizations enabled, I see the following code for the constructor.

x86-64 gcc 8.3:

X::X(unsigned long, unsigned long):
    push    rbp
    mov     rbp, rsp
    mov     QWORD PTR [rbp-8], rdi
    mov     QWORD PTR [rbp-16], rsi
    mov     QWORD PTR [rbp-24], rdx
    mov     rax, QWORD PTR [rbp-8]
    mov     QWORD PTR [rax], 0
    mov     rax, QWORD PTR [rbp-8]
    mov     QWORD PTR [rax+8], 0
    mov     rax, QWORD PTR [rbp-8]
    mov     rdx, QWORD PTR [rbp-16]
    mov     QWORD PTR [rax+8], rdx
    mov     rax, QWORD PTR [rbp-8]
    mov     rdx, QWORD PTR [rbp-24]
    mov     QWORD PTR [rax], rdx
    nop
    pop     rbp
    ret

x86-64 clang 8.0.0:

X::X(unsigned long, unsigned long):
    push    rbp
    mov     rbp, rsp
    mov     qword ptr [rbp - 8], rdi
    mov     qword ptr [rbp - 16], rsi
    mov     qword ptr [rbp - 24], rdx
    mov     rdx, qword ptr [rbp - 8]
    mov     qword ptr [rdx], 0
    mov     qword ptr [rdx + 8], 0
    mov     rsi, qword ptr [rbp - 16]
    mov     qword ptr [rdx + 8], rsi
    mov     rsi, qword ptr [rbp - 24]
    mov     qword ptr [rdx], rsi
    pop     rbp
    ret

Does anyone know why the output is so complex? I would have expected two simple "mov" statements, even with no optimizations enabled.

Answer 1

Un-optimized code always stores all C++ variables (including function args) into their memory location between statements, so that the values are available for the debugger to read and even modify . (And because it didn't spend any time doing register allocation.) This includes storing register args to memory before the first C++ statement of a function.

This is Intel-syntax assembly like from gcc -masm=intel , so it's using destination, source order. (We can tell based on using PTR, square brackets, and lack of % on register names.)

The first 3 stores are the function arguments (this, a, b) that were passed in registers RDI, RSI, and RDX as per the x86-64 System V ABI's calling convention.

mov     QWORD PTR [rbp-8], rdi        # this
mov     QWORD PTR [rbp-16], rsi       # a
mov     QWORD PTR [rbp-24], rdx       # b

Now it is loading this into rax and writing zeros into a_ and b_ because you did not use proper constructor initialization. Or possibly you added initialization to zero with some code you did not show here, or an odd compiler option.

mov     rax, QWORD PTR [rbp-8]
mov     QWORD PTR [rax], 0           # this->a_ = 0
mov     rax, QWORD PTR [rbp-8]
mov     QWORD PTR [rax+8], 0         # this->b_ = 0

Then it loads this into rax again and a into rdx , then writes this->a_ with rdx aka a . Same again for b .

Wait, actually that has to be a write to b_ first then a write into a_ because structs are required to match declaration and memory order. So [rax+8] has to be b_ , not a_ .

mov     rax, QWORD PTR [rbp-8]
mov     rdx, QWORD PTR [rbp-16]        # reload a
mov     QWORD PTR [rax+8], rdx         # this->b_ = a
mov     rax, QWORD PTR [rbp-8]
mov     rdx, QWORD PTR [rbp-24]        # reload b
mov     QWORD PTR [rax], rdx           # this->a_ = b

So your asm doesn't match the C++ source in your question.

Answer 2

What happens, and why?

If you don't turn on optimizations, the compiler stores all variables on the stack , and the compiler returns all values on the stack . The reason it does this is that it makes it easier for debuggers to keep track of what's going on in the program: they can observe the program's stack.

In addition, every function has to update the stack pointer when the function's entered, and reset the stack pointer when the function is exited. This is also for the debugger's benefit: the debugger can always tell exactly when you enter a function or exit a function.

Code with -O0 :

X::X(unsigned long, unsigned long):
    push    rbp        // Push the frame pointer to the stack
    mov     rbp, rsp   // Copy the frame pointer to the rsb register
    // Create the object (on the stack)
    mov     QWORD PTR [rbp-8], rdi  
    mov     QWORD PTR [rbp-16], rsi
    mov     QWORD PTR [rbp-24], rdx
    mov     rax, QWORD PTR [rbp-8]
    mov     rdx, QWORD PTR [rbp-16]
    mov     QWORD PTR [rax], rdx
    mov     rax, QWORD PTR [rbp-8]
    mov     rdx, QWORD PTR [rbp-24]
    mov     QWORD PTR [rax+8], rdx
    nop     // IDEK why it does this
    // Pop the frame pointer
    pop     rbp
    ret

Code with -O1 :

X::X(unsigned long, unsigned long):
    mov     rax, rdi
    mov     rdx, rsi
    ret

Does this matter?

Kind of. Code without optimizations is a lot slower, specifically because the compiler has to do stuff like this. But there's pretty much no reason not to enable optimization.

How to debug optimized code

Both gcc and clang have the -Og option: this option turns on all optimizations that don't interfere with debugging. If the debug version of the code is running slowly, try compiling it with -Og .

Code with -Og :

X::X(unsigned long, unsigned long):
    mov     rax, rdi
    mov     rdx, rsi
    ret

Resources

More information on -Og and other options to make code easy to debug: https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html

More information on optimization and optimization options: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options

Answer 3

As other have commented, the compiler is under no obligation to optimise your code when you don't ask it to, but a lot of the inefficiency stems from:

the compiler spilling parameters passed in registers to a holding area on the stack on entry to the function (and then using the copies on the stack thereafter)
the fact that Intel has no memory-to-memory MOV instruction

These two factors combine to give you the code you see in the disassembly (although clang clearly makes a better job of things than gcc here).

The compiler spills those registers to the stack to make debugging easier - because they are on the stack, the parameters passed into the function remain available throughout the function and this can be very helpful when debugging. Also, you can play tricks like patching in new values for aforesaid parameters at a breakpoint before continuing execution, when you realise what their values should actually be and want to then continue your debugging session.

I'm not sure why both compilers are zeroing a_ and b_ before assigning to them in your disassembly. I don't see this over at Godbolt .

Complex compiler output for simple constructor

Question

3 answers

solution1
7

solution2
3 2019-03-25 00:35:16

What happens, and why?

Does this matter?

How to debug optimized code

Resources

solution3
1 2019-03-25 00:28:58

Complex compiler output for simple constructor

Question

3 answers

solution1 7

solution2 3 2019-03-25 00:35:16

What happens, and why?

Does this matter?

How to debug optimized code

Resources

solution3 1 2019-03-25 00:28:58

solution1
7

solution2
3 2019-03-25 00:35:16

solution3
1 2019-03-25 00:28:58