为什么 clang 使 Quake 快速反平方根代码比使用 GCC 快 10 倍？（带有 (long)float 类型双关语）

Question

I'm trying to benchmark the fast inverse square root .我正在尝试对快速反平方根进行基准测试。 The full code is here:完整代码在这里：

#include <benchmark/benchmark.h>
#include <math.h>

float number = 30942;
    
static void BM_FastInverseSqrRoot(benchmark::State &state) {
    for (auto _ : state) {
        // from wikipedia:
        long i;
        float x2, y;
        const float threehalfs = 1.5F;

        x2 = number * 0.5F;
        y  = number;
        i  = * ( long * ) &y;
        i  = 0x5f3759df - ( i >> 1 );
        y  = * ( float * ) &i;
        y  = y * ( threehalfs - ( x2 * y * y ) );
        //  y  = y * ( threehalfs - ( x2 * y * y ) );
        
        float result = y;
        benchmark::DoNotOptimize(result);
    }
}


static void BM_InverseSqrRoot(benchmark::State &state) {
    for (auto _ : state) {
        float result = 1 / sqrt(number);
        benchmark::DoNotOptimize(result);
    } 
}

BENCHMARK(BM_FastInverseSqrRoot);
BENCHMARK(BM_InverseSqrRoot);

and here is the code in quick-bench if you want to run it yourself.如果您想自己运行，这里是 quick-bench 中的代码。

Compiling with GCC 11.2 and -O3, the BM_FastInverseSqrRoot is around 31 times slower than Noop (around 10 ns when I ran it locally on my machine).使用 GCC 11.2 和 -O3 编译，BM_FastInverseSqrRoot 比 Noop 慢大约 31 倍（当我在我的机器上本地运行它时大约 10 ns）。 Compiling with Clang 13.0 and -O3, it is around 3.6 times slower than Noop (around 1 ns when I ran it locally on my machine).使用 Clang 13.0 和 -O3 编译，它比 Noop 慢大约 3.6 倍（当我在我的机器上本地运行它时大约 1 ns）。 This is a 10x speed difference.这是 10 倍的速度差异。

Here is the relevant Assembly (taken from quick-bench).这是相关的程序集（取自快速工作台）。

With GCC:使用 GCC：

               push   %rbp
               mov    %rdi,%rbp
               push   %rbx
               sub    $0x18,%rsp
               cmpb   $0x0,0x1a(%rdi)
               je     408c98 <BM_FastInverseSqrRoot(benchmark::State&)+0x28>
               callq  40a770 <benchmark::State::StartKeepRunning()>
  408c84       add    $0x18,%rsp
               mov    %rbp,%rdi
               pop    %rbx
               pop    %rbp
               jmpq   40aa20 <benchmark::State::FinishKeepRunning()>
               nopw   0x0(%rax,%rax,1)
  408c98       mov    0x10(%rdi),%rbx
               callq  40a770 <benchmark::State::StartKeepRunning()>
               test   %rbx,%rbx
               je     408c84 <BM_FastInverseSqrRoot(benchmark::State&)+0x14>
               movss  0x1b386(%rip),%xmm4        # 424034 <_IO_stdin_used+0x34>
               movss  0x1b382(%rip),%xmm3        # 424038 <_IO_stdin_used+0x38>
               mov    $0x5f3759df,%edx
               nopl   0x0(%rax,%rax,1)
   408cc0      movss  0x237a8(%rip),%xmm0        # 42c470 <number>
               mov    %edx,%ecx
               movaps %xmm3,%xmm1
        2.91%  movss  %xmm0,0xc(%rsp)
               mulss  %xmm4,%xmm0
               mov    0xc(%rsp),%rax
        44.70% sar    %rax
        3.27%  sub    %eax,%ecx
        3.24%  movd   %ecx,%xmm2
        3.27%  mulss  %xmm2,%xmm0
        9.58%  mulss  %xmm2,%xmm0
        10.00% subss  %xmm0,%xmm1
        10.03% mulss  %xmm2,%xmm1
        9.64%  movss  %xmm1,0x8(%rsp)
        3.33%  sub    $0x1,%rbx
               jne    408cc0 <BM_FastInverseSqrRoot(benchmark::State&)+0x50>
               add    $0x18,%rsp
               mov    %rbp,%rdi
               pop    %rbx
               pop    %rbp
  408d0a       jmpq   40aa20 <benchmark::State::FinishKeepRunning()>

With Clang:使用铿锵声：

           push   %rbp
           push   %r14
           push   %rbx
           sub    $0x10,%rsp
           mov    %rdi,%r14
           mov    0x1a(%rdi),%bpl
           mov    0x10(%rdi),%rbx
           call   213a80 <benchmark::State::StartKeepRunning()>
           test   %bpl,%bpl
           jne    212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
           test   %rbx,%rbx
           je     212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
           movss  -0xf12e(%rip),%xmm0        # 203cec <_IO_stdin_used+0x8>
           movss  -0xf13a(%rip),%xmm1        # 203ce8 <_IO_stdin_used+0x4>
           cs nopw 0x0(%rax,%rax,1)
           nopl   0x0(%rax)
 212e30 2.46%  movd   0x3c308(%rip),%xmm2        # 24f140 <number>
        4.83%  movd   %xmm2,%eax
        8.07%  mulss  %xmm0,%xmm2
        12.35% shr    %eax
        2.60%  mov    $0x5f3759df,%ecx
        5.15%  sub    %eax,%ecx
        8.02%  movd   %ecx,%xmm3
        11.53% mulss  %xmm3,%xmm2
        3.16%  mulss  %xmm3,%xmm2
        5.71%  addss  %xmm1,%xmm2
        8.19%  mulss  %xmm3,%xmm2
        16.44% movss  %xmm2,0xc(%rsp)
        11.50% add    $0xffffffffffffffff,%rbx
               jne    212e30 <BM_FastInverseSqrRoot(benchmark::State&)+0x40>
 212e69        mov    %r14,%rdi
               call   213af0 <benchmark::State::FinishKeepRunning()>
               add    $0x10,%rsp
               pop    %rbx
               pop    %r14
               pop    %rbp
  212e79       ret

They look pretty similar to me.他们看起来和我很相似。 Both seem to be using SIMD registers/instructions like mulss .两者似乎都在使用像mulss这样的 SIMD 寄存器/指令。 The GCC version has a sar that is supposedly taking 46%? GCC 版本的sar应该占 46%？ (But I think it's just mislabelled and it's the mulss, mov, sar that together take 46%). （但我认为它只是贴错了标签，而mulss, mov, sar共占 46%）。 Anyway, I'm not familiar enough with Assembly to really tell what is causing such a huge performance difference.无论如何，我对 Assembly 还不够熟悉，无法真正说出造成如此巨大性能差异的原因。

Anyone know?有人知道吗？

Answer 1

Just FYI, Is it still worth using the Quake fast inverse square root algorithm nowadays on x86-64?仅供参考，现在还值得在 x86-64 上使用 Quake 快速逆平方根算法吗？ - no, obsoleted by SSE1 rsqrtss which you can use with or without a Newton iteration. - 不，已被 SSE1 rsqrtss ，您可以使用或不使用牛顿迭代。

As people pointed out in comments, you're using 64-bit long (since this is x86-64 on a non-Windows system), pointing it at a 32-bit float .正如人们在评论中指出的那样，您使用的是 64 位long （因为这是非 Windows 系统上的 x86-64），将其指向 32 位float 。 So as well as a strict-aliasing violation (use memcpy or std::bit_cast<int32_t>(myfloat) for type punning), that's a showstopper for performance as well as correctness.因此，除了严格的混叠违规（使用memcpy或std::bit_cast<int32_t>(myfloat)进行类型双关语）之外，这也是性能和正确性的阻碍。

Your perf report output confirms it;您的perf report输出证实了这一点； GCC is doing a 32-bit movss %xmm0,0xc(%rsp) store to the stack, then a 64-bit reload mov 0xc(%rsp),%rax , which will cause a store forwarding stall costing much extra latency. GCC 正在对堆栈执行 32 位movss %xmm0,0xc(%rsp)存储，然后执行 64 位重新加载mov 0xc(%rsp),%rax ，这将导致存储转发停止，从而导致额外延迟。 And a throughput penalty, since actually you're testing throughput, not latency: the next computation of an inverse sqrt only has a constant input, not the result of the previous iteration.还有吞吐量损失，因为实际上您是在测试吞吐量，而不是延迟：逆 sqrt 的下一次计算只有一个恒定输入，而不是前一次迭代的结果。 ( benchmark::DoNotOptimize contains a "memory" clobber which stops GCC/clang from hoisting most of the computation out of the loop; they have to assume number may have changed since it's not const .) （ benchmark::DoNotOptimize包含一个"memory" clobber，它阻止 GCC/clang 将大部分计算提升到循环之外；他们必须假设number可能已经改变，因为它不是const 。）

The instruction waiting for the load result (the sar ) is getting the blame for those cycles, as usual.像往常一样，等待加载结果的指令（ sar ）是这些周期的罪魁祸首。 (When an interrupt fires to collect a sample upon the cycles event counter wrapping around, the CPU has to figure out one instruction to blame for that event. Usually this ends up being the one waiting for an earlier slow instruction, or maybe just one after a slow instruction even without a data dependency, I forget.) （当一个中断触发以在cycles事件计数器环绕时收集样本时，CPU 必须找出一条指令应归咎于该事件。通常这最终是等待较早的慢指令的一条指令，或者可能只是之后的一条指令即使没有数据依赖，指令也很慢，我忘记了。）

Clang chooses to assume that the upper 32 bits are zero, thus movd %xmm0, %eax to just copy the register with an ALU uop, and the shr instead of sar because it knows it's shifting in a zero from the high half of the 64-bit long it's pretending to work with. Clang 选择假设高 32 位为零，因此movd %xmm0, %eax仅使用 ALU uop 复制寄存器，而shr而不是sar因为它知道它从 64 位的高半部分移入零- 有点long ，它假装与之合作。 (A function call still used %rdi so that isn't Windows clang.) （函数调用仍然使用%rdi ，所以这不是 Windows 叮当声。）

Bugfixed version: GCC and clang make similar asm修正版本：GCC 和 clang 做类似的 asm

Fixing the code on the quick-bench link in the question to use int32_t and std::bit_cast , https://godbolt.org/z/qbxqsaW4e shows GCC and clang compile similarly with -Ofast , although not identical.修复问题中快速工作台链接上的代码以使用int32_t和std::bit_cast ， https ://godbolt.org/z/qbxqsaW4e 显示 GCC 和 clang 与-Ofast编译类似，尽管不相同。 eg GCC loads number twice, once into an integer register, once into XMM0.例如，GCC 加载number两次，一次进入整数寄存器，一次进入 XMM0。 Clang loads once and uses movd eax, xmm2 to get it. Clang 加载一次并使用movd eax, xmm2来获取它。

On QB ( https://quick-bench.com/q/jYLeX2krrTs0afjQKFp6Nm_G2v8 ), now GCC's BM_FastInverseSqrRoot is faster by a factor of 2 than the naive version, without -ffast-math在 QB（ https://quick-bench.com/q/jYLeX2krrTs0afjQKFp6Nm_G2v8 ）上，现在 GCC 的 BM_FastInverseSqrRoot 比原始版本快 2 倍，没有-ffast-math

And yes, the naive benchmark compiles to sqrtss / divss without -ffast-math , thanks to C++ inferring sqrtf from sqrt(float) .是的，由于 C++ 从sqrt(float)推断sqrtf ，朴素的基准测试编译为sqrtss / divss而没有-ffast-math 。 It does check for the number being >=0 every time, since quick-bench doesn't allow compiling with -fno-math-errno to omit that check to maybe call the libm function.它每次都会检查数字是否>=0 ，因为 quick-bench 不允许使用-fno-math-errno进行编译以省略该检查以调用 libm 函数。 But that branch predicts perfectly so the loop should still easily just bottleneck on port 0 throughput (div/sqrt unit).但是该分支预测完美，因此循环应该仍然很容易成为端口 0 吞吐量（div/sqrt 单位）的瓶颈。

Quick-bench does allow -Ofast , which is equivalent to -O3 -ffast-math , which uses rsqrtss and a Newton iteration. Quick-bench 确实允许-Ofast ，这相当于-O3 -ffast-math ，它使用rsqrtss和牛顿迭代。 (Would be even faster with FMA available, but quick-bench doesn't allow -march=native or anything. I guess one could use __attribute__((target("avx,fma"))) . （使用 FMA 会更快，但 quick-bench 不允许-march=native或任何东西。我想可以使用__attribute__((target("avx,fma"))) 。

Quick-bench is now giving Error or timeout whether I use that or not, with Permission error mapping pages.无论我是否使用，Quick-bench 现在都会通过权限错误映射页面给出Error or timeout 。 and suggesting a smaller -m/--mmap_pages so I can't test on that system.并建议使用较小的-m/--mmap_pages以便我无法在该系统上进行测试。

rsqrt with a Newton iteration (like compilers use at -Ofast for this) is probably faster or similar to Quake's fast invsqrt, but with about 23 bits of precision.带有牛顿迭代的 rsqrt（就像编译器为此使用的-Ofast一样）可能更快或类似于 Quake 的快速 invsqrt，但精度约为 23 位。

为什么 clang 使 Quake 快速反平方根代码比使用 GCC 快 10 倍？（带有 (long)float 类型双关语）

问题描述

1 个解决方案

解决方案1
7 已采纳 2022-05-21 21:35:08

Bugfixed version: GCC and clang make similar asm修正版本：GCC 和 clang 做类似的 asm

为什么 clang 使 Quake 快速反平方根代码比使用 GCC 快 10 倍？ （带有 *(long*)float 类型双关语）

问题描述

1 个解决方案

解决方案1 7 已采纳 2022-05-21 21:35:08

Bugfixed version: GCC and clang make similar asm修正版本：GCC 和 clang 做类似的 asm

为什么 clang 使 Quake 快速反平方根代码比使用 GCC 快 10 倍？（带有 (long)float 类型双关语）

解决方案1
7 已采纳 2022-05-21 21:35:08