简体   繁体   English

为什么 gcc 仅在 SS/SD 指令中使用较低值时不将 XMM 寄存器的较高值归零?

[英]Why doesn't gcc zero the upper values of an XMM register when only using the lower value with SS/SD instructions?

For example with such function,例如用这样的 function,

int fb(char a, char b, char c, char d) {
    return (a + b) - (c + d);
}

gcc 's assembly output is, gcc的总成 output 是,

fb:
        movsx   esi, sil
        movsx   edi, dil
        movsx   ecx, cl
        movsx   edx, dl
        add     edi, esi
        add     edx, ecx
        mov     eax, edi
        sub     eax, edx
        ret

Vaguely, I understand that the purpose of movsx is to remove the dependency from the previous value of the register, but honestly I still don't understand exactly what kind of dependency it is trying to remove.含糊地,我知道movsx的目的是从寄存器的先前值中删除依赖关系,但老实说,我仍然不明白它试图删除什么样的依赖关系。 I mean, for example, whether or not there is movsx esi, sil , if some value is being written to esi , then any operation using esi will have to wait, if a value is being read from esi , any operation modifying the value of esi will have to wait, and if esi isn't being used by any operation, the code will continue to run.我的意思是,例如,是否存在movsx esi, sil ,如果某个值被写入esi ,那么使用esi的任何操作都必须等待,如果从esi读取值,任何修改值的操作esi将不得不等待,如果esi没有被任何操作使用,代码将继续运行。 What difference does movsx make? movsx有什么不同? I cannot say the compiler is doing wrong because movsx or movzx is (almost?) always produced by any compiler whenever loading values smaller than 32-bits.我不能说编译器做错了,因为movsxmovzx (几乎?)在加载小于 32 位的值时总是由任何编译器生成。

Apart from my lack of understanding, gcc behaves differently with float s.除了我缺乏理解之外, gcc的行为与float不同。

float ff(float a, float b, float c, float d) {
    return (a + b) - (c + d);
}

is compiled to,编译为,

ff:
        addss   xmm0, xmm1
        addss   xmm2, xmm3
        subss   xmm0, xmm2
        ret

If the same logic was applied, I believe the output should be something like,如果应用相同的逻辑,我相信 output 应该是这样的,

ff:
        movd    xmm0, xmm0
        movd    xmm1, xmm1
        movd    xmm2, xmm2
        movd    xmm3, xmm3
        addss   xmm0, xmm1
        addss   xmm2, xmm3
        subss   xmm0, xmm2
        ret

So I'm actually asking 2 questions.所以我实际上是在问两个问题。

  1. Why does gcc behave differently with float s?为什么gcc的行为与float不同?
  2. What difference does movsx make? movsx有什么不同?
  1. The return value is the same width as the args so no extension is needed.返回值与 args 的宽度相同,因此不需要扩展。 The parts of registers outside the type widths are allowed to hold garbage in x86 and x86-64 calling conventions.在 x86 和 x86-64 调用约定中,允许类型宽度之外的寄存器部分保存垃圾。 (This applies to both GP integer and vector registers.) (这适用于 GP integer 和向量寄存器。)

    Except for an undocumented extension which clang depends on, where callers extend narrow args to 32-bit;除了 clang 依赖的未记录扩展外,调用者将窄参数扩展到 32 位; clang will skip the movsx instructions in your char example. clang 将跳过您的char示例中的movsx指令。 https://godbolt.org/z/Gv5e4h3Eh https://godbolt.org/z/Gv5e4h3Eh

    Is a sign or zero extension required when adding a 32bit offset to a pointer for the x86-64 ABI? 向 x86-64 ABI 的指针添加 32 位偏移时是否需要符号或零扩展? covers both the high garbage and the unofficial extension to the calling convention.涵盖了高垃圾和调用约定的非官方扩展。

    Since you asked about false dependencies, note that compilers do use movaps xmm,xmm to copy a scalar.由于您询问了错误依赖项,请注意编译器确实使用movaps xmm,xmm来复制标量。 (eg in GCC's missed optimizations in (ab) + (ad) we need to subtract from a twice. It's non-commutative so we need a copy: https://godbolt.org/z/Tvx19raa3 (例如,在(ab) + (ad)中 GCC 错过的优化中,我们需要从a中减去两次。它是不可交换的,所以我们需要一个副本: https://godbolt.org/z/Tvx19raa3

  2. C integer promotion rules mean that a+b for narrow inputs is equivalent to (int)a + (int)b . C integer 提升规则意味着窄输入的a+b等价于(int)a + (int)b In all x86 / x86-64 ABIs, char is a signed type (unlike on ARM for example), so it needs to be sign extended to int width, not zero extended.在所有 x86 / x86-64 ABI 中, char是有符号类型(例如,与 ARM 不同),因此需要将其符号扩展为int宽度,而不是零扩展。 And definitely not truncated.并且绝对不会被截断。

    If you truncated the result again by returning a char , compilers could if they wanted just do 8-bit adds.如果您通过返回char再次截断结果,编译器可以只做 8 位加法。 But actually they'll use 32-bit adds and leave whatever high garbage there: https://godbolt.org/z/hGdbecPqv .但实际上他们将使用 32 位添加并在那里留下任何高垃圾: https://godbolt.org/z/hGdbecPqv It's not doing this for dep-breaking / performance, just correctness.这样做不是为了破坏/性能,只是为了正确。

    As far as performance, GCC's behaviour of reading the 32-bit reg for a char is good if the caller wrote the full register (which the unofficial extension to the calling convention requires anyway), or on CPUs that don't rename low 8 separately from the rest of the reg (everything other than P6-family: SnB-family only renames high-8 regs, except for original Sandybridge itself. Why doesn't GCC use partial registers? )就性能而言,如果调用者编写了完整的寄存器(调用约定的非官方扩展无论如何都需要),或者在不单独重命名低 8 的 CPU 上,GCC 读取char的 32 位 reg 的行为很好来自 reg 的 rest (除 P6 系列之外的所有内容:SnB 系列仅重命名高 8 regs,除了原始 Sandybridge 本身。 为什么 GCC 不使用部分寄存器?


PS: there's no such instruction as movd xmm0, xmm0 , only a different form of movq xmm0, xmm0 which yes would zero-extend the low 64 bits of an XMM register into the full reg. PS:没有像movd xmm0, xmm0这样的指令,只有一种不同形式的movq xmm0, xmm0 ,可以将 XMM 寄存器的低 64 位零扩展为完整的寄存器。

If you want to see various compiler attempts to zero-extend the low dword, with/without SSE4.1 insertps , look at asm for __m128 foo(float f) { return _mm_set_ss(f); }如果您想查看各种编译器尝试对低 dword 进行零扩展,无论是否使用 SSE4.1 insertps ,请查看 asm for __m128 foo(float f) { return _mm_set_ss(f); } __m128 foo(float f) { return _mm_set_ss(f); } in the Godbolt link above. __m128 foo(float f) { return _mm_set_ss(f); }在上面的 Godbolt 链接中。 eg with just SSE2, zero a register with pxor, then movss xmm1, xmm0 .例如,仅使用 SSE2,使用 pxor 将寄存器归零,然后movss xmm1, xmm0 Otherwise, insertps or xor-zero and blendps .否则, insertps或 xor-zero 和blendps

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM