Order of assignment produces different assembly

Question

This experiment was done using GCC 6.3. There are two functions where the only difference is in the order we assign the i32 and i16 in the struct. We assumed that both functions should produce the same assembly. However this is not the case. The "bad" function produces more instructions. Can anyone explain why this happens?

#include <inttypes.h>

union pack {
    struct {
        int32_t i32;
        int16_t i16;
    };
    void *ptr;
};
static_assert(sizeof(pack)==8, "what?");

void *bad(const int32_t i32, const int16_t i16) { 
    pack p;
    p.i32 = i32;
    p.i16 = i16;
    return p.ptr;
}

void *good(const int32_t i32, const int16_t i16) { 
    pack p;
    p.i16 = i16;
    p.i32 = i32;
    return p.ptr;
}

...

bad(int, short):
        movzx   eax, si
        sal     rax, 32
        mov     rsi, rax
        mov     eax, edi
        or      rax, rsi
        ret
good(int, short):
        movzx   eax, si
        mov     edi, edi
        sal     rax, 32
        or      rax, rdi
        ret

The compiler flags were -O3 -fno-rtti -std=c++14

Answer 1

This is/was a missed optimization in GCC10.2 and earlier. It seems to already be fixed in current nightly builds of GCC , so no need to report a missed-optimization bug on GCC's bugzilla. ( https://gcc.gnu.org/bugzilla/ ). It looks like it first appeared as a regression from GCC4.8 to GCC4.9. ( Godbolt )

# GCC11-dev nightly build
# actually *better* than "good", avoiding a mov-elimination missed opt.
bad(int, short):
        movzx   esi, si          # mov-elimination never works for 16->32 movzx
        mov     eax, edi         # mov-elimination works between different regs
        sal     rsi, 32
        or      rax, rsi
        ret

Yes, you'd generally expect C++ that implements the same logic basically the same way to compile to the same asm, as long as optimization is enabled, or at least hope so ¹ . And generally you can hope that there are no pointless missed optimizations that waste instructions for no apparent reason (rather than simply picking a different implementation strategy), but unfortunately that's not always true either.

Writing different parts of the same object and then reading the whole object is tricky for compilers in general so it's not a shock to see different asm when you write different parts of the full object in a different order.

Note that there's nothing "smart" about the bad asm, it's just doing a redundant mov instruction. Having to take input in fixed registers and produce output in another specific hard register to satisfy the calling convention is something GCC's register allocator isn't amazing at: wasted mov missed optimizations like this are more common in tiny functions than when part of a larger function.

If you're really curious, you could dig into the GIMPLE and RTL internal representations that GCC transformed through to get here. (Godbolt has a GCC tree-dump pane to help with this.)

Footnote 1: Or at least hope that, but missed-optimization bugs do happen in real life. Report them when you spot them, in case it's something that GCC or LLVM devs can easily teach the optimizer to avoid. Compilers are complex pieces of machinery with multiple passes; often a corner case for one part of the optimizer just didn't used to happen until some other optimization pass changed to doing something else, exposing a poor end result for a case the author of that code wasn't thinking about when writing / tweaking it to improve some other case.

Note that there's no Undefined Behaviour here despite the complaints in comments: The GNU dialect of C and C++ defines the behaviour of union type-punning in C89 and C++ , not just in C99 and later like ISO C does. Implementations are free to define the behaviour of anything that ISO C++ leaves undefined.

Well technically there is a read-uninitialized because the upper 2 bytes of the void* object haven't been written yet in pack p . But fixing it with pack p = {.ptr=0};doesn't help. (And doesn't change the asm; GCC happened to already zero the padding because that's convenient).

Also note, both versions in the question are less efficient than possible:

(The bad output from GCC4.8 or GCC11-trunk avoiding the wasted mov looks optimal for that choice of strategy.)

mov edi,edi defeats mov-elimination on both Intel and AMD, so that instruction has 1 cycle latency instead of 0, and costs a back-end µop. Picking a different register to zero-extend into would be cheaper. We could even pick RSI after reading SI, but any call-clobbered register would work.

hand_written:
    movzx  eax, si    # 16->32 can't be eliminated, only 8->32 and 32->32 mov
    shl    rax, 32
    mov    ecx, edi   # zero-extend into a different reg with 0 latency
    or     rax, rcx
    ret

Or if optimizing for code-size or throughput on Intel (low µop count, not low latency), shld is an option: 1 µop / 3c latency on Intel, but 6 µops on Zen (also 3c latency, though). ( https://uops.info/ and https://agner.org/optimize/ )

minimal_uops_worse_latency:  # also more uops on AMD.
    movzx  eax, si
    shl    rdi, 32              # int32 bits to the top of RDI
    shld   rax, rdi, 32         # shift the high 32 bits of RDI into RAX.
    ret

If your struct was ordered the other way, with the padding in the middle, you could do something involving mov ax, si to merge into RAX. That could be efficient on non-Intel, and on Haswell and later which don't do partial-register renaming except for high-8 regs like AH.

Given the read-uninitialized UB, you could just compile it to literally anything, including ret or ud2 . Or slightly less aggressive, you could compile it to just leave garbage for the padding part of the struct, the last 2 bytes.

high_garbage:
    shl    rsi, 32    # leaving high garbage = incoming high half of ESI
    mov    eax, edi   # zero-extend into RAX
    or     rax, rsi
    ret

Note that an unofficial extension to the x86-64 System V ABI (which clang actually depends on) is that narrow args are sign- or zero-extended to 32 bits. So instead of zeros, the high 2 bytes of the pointer would be copies of the sign bit. (Which would actually guarantee that it's a canonical 48-bit virtual address on x86-64!)

Order of assignment produces different assembly

Question

1 answers

solution1
4 ACCPTED 2020-10-15 14:37:06

Order of assignment produces different assembly

Question

1 answers

solution1 4 ACCPTED 2020-10-15 14:37:06

solution1
4 ACCPTED 2020-10-15 14:37:06