简体   繁体   中英

Why doesn't gcc zero the upper values of an XMM register when only using the lower value with SS/SD instructions?

For example with such function,

int fb(char a, char b, char c, char d) {
    return (a + b) - (c + d);
}

gcc 's assembly output is,

fb:
        movsx   esi, sil
        movsx   edi, dil
        movsx   ecx, cl
        movsx   edx, dl
        add     edi, esi
        add     edx, ecx
        mov     eax, edi
        sub     eax, edx
        ret

Vaguely, I understand that the purpose of movsx is to remove the dependency from the previous value of the register, but honestly I still don't understand exactly what kind of dependency it is trying to remove. I mean, for example, whether or not there is movsx esi, sil , if some value is being written to esi , then any operation using esi will have to wait, if a value is being read from esi , any operation modifying the value of esi will have to wait, and if esi isn't being used by any operation, the code will continue to run. What difference does movsx make? I cannot say the compiler is doing wrong because movsx or movzx is (almost?) always produced by any compiler whenever loading values smaller than 32-bits.

Apart from my lack of understanding, gcc behaves differently with float s.

float ff(float a, float b, float c, float d) {
    return (a + b) - (c + d);
}

is compiled to,

ff:
        addss   xmm0, xmm1
        addss   xmm2, xmm3
        subss   xmm0, xmm2
        ret

If the same logic was applied, I believe the output should be something like,

ff:
        movd    xmm0, xmm0
        movd    xmm1, xmm1
        movd    xmm2, xmm2
        movd    xmm3, xmm3
        addss   xmm0, xmm1
        addss   xmm2, xmm3
        subss   xmm0, xmm2
        ret

So I'm actually asking 2 questions.

  1. Why does gcc behave differently with float s?
  2. What difference does movsx make?
  1. The return value is the same width as the args so no extension is needed. The parts of registers outside the type widths are allowed to hold garbage in x86 and x86-64 calling conventions. (This applies to both GP integer and vector registers.)

    Except for an undocumented extension which clang depends on, where callers extend narrow args to 32-bit; clang will skip the movsx instructions in your char example. https://godbolt.org/z/Gv5e4h3Eh

    Is a sign or zero extension required when adding a 32bit offset to a pointer for the x86-64 ABI? covers both the high garbage and the unofficial extension to the calling convention.

    Since you asked about false dependencies, note that compilers do use movaps xmm,xmm to copy a scalar. (eg in GCC's missed optimizations in (ab) + (ad) we need to subtract from a twice. It's non-commutative so we need a copy: https://godbolt.org/z/Tvx19raa3

  2. C integer promotion rules mean that a+b for narrow inputs is equivalent to (int)a + (int)b . In all x86 / x86-64 ABIs, char is a signed type (unlike on ARM for example), so it needs to be sign extended to int width, not zero extended. And definitely not truncated.

    If you truncated the result again by returning a char , compilers could if they wanted just do 8-bit adds. But actually they'll use 32-bit adds and leave whatever high garbage there: https://godbolt.org/z/hGdbecPqv . It's not doing this for dep-breaking / performance, just correctness.

    As far as performance, GCC's behaviour of reading the 32-bit reg for a char is good if the caller wrote the full register (which the unofficial extension to the calling convention requires anyway), or on CPUs that don't rename low 8 separately from the rest of the reg (everything other than P6-family: SnB-family only renames high-8 regs, except for original Sandybridge itself. Why doesn't GCC use partial registers? )


PS: there's no such instruction as movd xmm0, xmm0 , only a different form of movq xmm0, xmm0 which yes would zero-extend the low 64 bits of an XMM register into the full reg.

If you want to see various compiler attempts to zero-extend the low dword, with/without SSE4.1 insertps , look at asm for __m128 foo(float f) { return _mm_set_ss(f); } __m128 foo(float f) { return _mm_set_ss(f); } in the Godbolt link above. eg with just SSE2, zero a register with pxor, then movss xmm1, xmm0 . Otherwise, insertps or xor-zero and blendps .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM