简体   繁体   中英

GCC won't use its own optimization trick without -fwrapv

Consider this C++ code:

#include <cstdint>

// returns a if less than b or if b is INT32_MIN
int32_t special_min(int32_t a, int32_t b)
{
    return a < b || b == INT32_MIN ? a : b;
}

GCC with -fwrapv correctly realizes that subtracting 1 from b can eliminate the special case, and it generates this code for x86-64 :

    lea     edx, [rsi-1]
    mov     eax, edi
    cmp     edi, edx
    cmovg   eax, esi
    ret

But without -fwrapv it generates worse code:

    mov     eax, esi
    cmp     edi, esi
    jl      .L4
    cmp     esi, -2147483648
    je      .L4
    ret
.L4:
    mov     eax, edi
    ret

I understand that -fwrapv is needed if I write C++ code which relies on signed overflow. But:

  1. The above C++ code does not depend on signed overflow (it is valid standard C++).
  2. We all know that signed overflow has a specific behavior on x86-64.
  3. The compiler knows it is compiling for x86-64.

If I wrote "hand optimized" C++ code trying to implement that optimization, I understand -fwrapv would be required, otherwise the compiler could decide the signed overflow is UB and do whatever it wants in the case where b == INT32_MIN . But here the compiler is in control, and I don't see what stops it from using the optimization without -fwrapv . Is there some reason it isn't allowed to?

This kind of missed optimization has happened before in GCC, like not fully treating signed int add as associative even though it's compiling for a 2's complement target with wrapping addition. So it optimizes better for unsigned. IIRC, the reason was something like GCC losing track of some information it had about the operations, and thus being conservative? I forget if that ever got fixed.

I can't find where I've seen this before on SO with a reply from a GCC dev about the internals; maybe it was in a GCC bug report? I think it was with something like a+b+c+d+e (not) re-associating into a tree of dependencies to shorten the critical path. But unfortunately it's still present in current GCC:

int sum(int a, int b, int c, int d, int e, int f) {
    return a+b+c+d+e+f;
    // gcc and clang make one stupid dep chain
}

int sumv2(int a, int b, int c, int d, int e, int f) {
    return (a+b)+(c+d)+(e+f);
    // clang pessimizes this back to 1 chain, GCC doesn't
}

unsigned sumu(unsigned a, unsigned b, unsigned c, unsigned d, unsigned e, unsigned f) {
    return a+b+c+d+e+f;
    // gcc and clang make one stupid dep chain
}

unsigned sumuv2(unsigned a, unsigned b, unsigned c, unsigned d, unsigned e, unsigned f) {
    return (a+b)+(c+d)+(e+f);
    // GCC and clang pessimize back to 1 chain for unsigned
}

Godbolt for x86-64 System V at -O3 , clang and gcc -fwrapv make the same asm for all 4 functions , as you'd expect.

GCC (without -fwrapv ) makes the same asm for sumu as for sumuv2 (summing into r8d , the reg that held e .) But GCC makes different asm for sum and sumv2 , because they use signed int

# gcc -O3 *without* -fwrapv
# The same order of order of operations as the C source
sum(int, int, int, int, int, int):
        add     edi, esi     # a += b
        add     edi, edx     # ((a+b) + c) ...
        add     edi, ecx     # sum everything into EDI
        add     edi, r8d
        lea     eax, [rdi+r9]
        ret

# also as written, the source order of operations:
sumv2(int, int, int, int, int, int):
        add     edi, esi    # a+=b
        add     edx, ecx    # c+=d
        add     r8d, r9d    # e+=f
        add     edi, edx       # a += c
        lea     eax, [rdi+r8]  # retval = a + e
        ret

So ironically GCC makes better asm when it doesn't re-associate the source. That's assuming that all 6 inputs are ready at once. If out-of-order exec of earlier code only produced the input registers 1 per cycle, the final result here would be ready only 1 cycle after the final input was ready, assuming that final input was f .

But if the last input was a or b , the result wouldn't be ready until 5 cycles later with the single chain like GCC and clang use when they can. vs. 3 cycles worst case for the tree reduction, 2 cycle best case (if e or f were ready last).

(Update: -mtune=znver2 makes GCC re-associate into a tree, thanks @amonakov. So this is a tuning choice with a default that seems strange to me, at least for this specific problem-size. See GCC source , search for reassoc to see costs for other tuning settings; most of them are 1,1,1,1 which is insane, especially for floating-point. This might be why GCC fails to use multiple vector accumulators when unrolling FP loops, defeating the purpose.)

But anyway, this is a case of GCC only re-associating signed int with -fwrapv . So clearly it limits itself more than necessary without -fwrapv .


Related: Compiler optimizations may cause integer overflow. Is that okay? - it is of course legal, and failure to do it is a missed optimization.

GCC isn't totally hamstrung by signed int ; it will auto-vectorize int sum += arr[i] , and it does manage to optimize Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)? for signed int a .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM