為什么 clang 使 Quake 快速反平方根代碼比使用 GCC 快 10 倍？（帶有 (long)float 類型雙關語）

Question

我正在嘗試對快速反平方根進行基准測試。 完整代碼在這里：

#include <benchmark/benchmark.h>
#include <math.h>

float number = 30942;
    
static void BM_FastInverseSqrRoot(benchmark::State &state) {
    for (auto _ : state) {
        // from wikipedia:
        long i;
        float x2, y;
        const float threehalfs = 1.5F;

        x2 = number * 0.5F;
        y  = number;
        i  = * ( long * ) &y;
        i  = 0x5f3759df - ( i >> 1 );
        y  = * ( float * ) &i;
        y  = y * ( threehalfs - ( x2 * y * y ) );
        //  y  = y * ( threehalfs - ( x2 * y * y ) );
        
        float result = y;
        benchmark::DoNotOptimize(result);
    }
}


static void BM_InverseSqrRoot(benchmark::State &state) {
    for (auto _ : state) {
        float result = 1 / sqrt(number);
        benchmark::DoNotOptimize(result);
    } 
}

BENCHMARK(BM_FastInverseSqrRoot);
BENCHMARK(BM_InverseSqrRoot);

如果您想自己運行，這里是 quick-bench 中的代碼。

使用 GCC 11.2 和 -O3 編譯，BM_FastInverseSqrRoot 比 Noop 慢大約 31 倍（當我在我的機器上本地運行它時大約 10 ns）。 使用 Clang 13.0 和 -O3 編譯，它比 Noop 慢大約 3.6 倍（當我在我的機器上本地運行它時大約 1 ns）。 這是 10 倍的速度差異。

這是相關的程序集（取自快速工作台）。

使用 GCC：

               push   %rbp
               mov    %rdi,%rbp
               push   %rbx
               sub    $0x18,%rsp
               cmpb   $0x0,0x1a(%rdi)
               je     408c98 <BM_FastInverseSqrRoot(benchmark::State&)+0x28>
               callq  40a770 <benchmark::State::StartKeepRunning()>
  408c84       add    $0x18,%rsp
               mov    %rbp,%rdi
               pop    %rbx
               pop    %rbp
               jmpq   40aa20 <benchmark::State::FinishKeepRunning()>
               nopw   0x0(%rax,%rax,1)
  408c98       mov    0x10(%rdi),%rbx
               callq  40a770 <benchmark::State::StartKeepRunning()>
               test   %rbx,%rbx
               je     408c84 <BM_FastInverseSqrRoot(benchmark::State&)+0x14>
               movss  0x1b386(%rip),%xmm4        # 424034 <_IO_stdin_used+0x34>
               movss  0x1b382(%rip),%xmm3        # 424038 <_IO_stdin_used+0x38>
               mov    $0x5f3759df,%edx
               nopl   0x0(%rax,%rax,1)
   408cc0      movss  0x237a8(%rip),%xmm0        # 42c470 <number>
               mov    %edx,%ecx
               movaps %xmm3,%xmm1
        2.91%  movss  %xmm0,0xc(%rsp)
               mulss  %xmm4,%xmm0
               mov    0xc(%rsp),%rax
        44.70% sar    %rax
        3.27%  sub    %eax,%ecx
        3.24%  movd   %ecx,%xmm2
        3.27%  mulss  %xmm2,%xmm0
        9.58%  mulss  %xmm2,%xmm0
        10.00% subss  %xmm0,%xmm1
        10.03% mulss  %xmm2,%xmm1
        9.64%  movss  %xmm1,0x8(%rsp)
        3.33%  sub    $0x1,%rbx
               jne    408cc0 <BM_FastInverseSqrRoot(benchmark::State&)+0x50>
               add    $0x18,%rsp
               mov    %rbp,%rdi
               pop    %rbx
               pop    %rbp
  408d0a       jmpq   40aa20 <benchmark::State::FinishKeepRunning()>

使用鏗鏘聲：

           push   %rbp
           push   %r14
           push   %rbx
           sub    $0x10,%rsp
           mov    %rdi,%r14
           mov    0x1a(%rdi),%bpl
           mov    0x10(%rdi),%rbx
           call   213a80 <benchmark::State::StartKeepRunning()>
           test   %bpl,%bpl
           jne    212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
           test   %rbx,%rbx
           je     212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
           movss  -0xf12e(%rip),%xmm0        # 203cec <_IO_stdin_used+0x8>
           movss  -0xf13a(%rip),%xmm1        # 203ce8 <_IO_stdin_used+0x4>
           cs nopw 0x0(%rax,%rax,1)
           nopl   0x0(%rax)
 212e30 2.46%  movd   0x3c308(%rip),%xmm2        # 24f140 <number>
        4.83%  movd   %xmm2,%eax
        8.07%  mulss  %xmm0,%xmm2
        12.35% shr    %eax
        2.60%  mov    $0x5f3759df,%ecx
        5.15%  sub    %eax,%ecx
        8.02%  movd   %ecx,%xmm3
        11.53% mulss  %xmm3,%xmm2
        3.16%  mulss  %xmm3,%xmm2
        5.71%  addss  %xmm1,%xmm2
        8.19%  mulss  %xmm3,%xmm2
        16.44% movss  %xmm2,0xc(%rsp)
        11.50% add    $0xffffffffffffffff,%rbx
               jne    212e30 <BM_FastInverseSqrRoot(benchmark::State&)+0x40>
 212e69        mov    %r14,%rdi
               call   213af0 <benchmark::State::FinishKeepRunning()>
               add    $0x10,%rsp
               pop    %rbx
               pop    %r14
               pop    %rbp
  212e79       ret

他們看起來和我很相似。 兩者似乎都在使用像mulss這樣的 SIMD 寄存器/指令。 GCC 版本的sar應該占 46%？ （但我認為它只是貼錯了標簽，而mulss, mov, sar共占 46%）。 無論如何，我對 Assembly 還不夠熟悉，無法真正說出造成如此巨大性能差異的原因。

有人知道嗎？

Answer 1

僅供參考，現在還值得在 x86-64 上使用 Quake 快速逆平方根算法嗎？ - 不，已被 SSE1 rsqrtss ，您可以使用或不使用牛頓迭代。

正如人們在評論中指出的那樣，您使用的是 64 位long （因為這是非 Windows 系統上的 x86-64），將其指向 32 位float 。 因此，除了嚴格的混疊違規（使用memcpy或std::bit_cast<int32_t>(myfloat)進行類型雙關語）之外，這也是性能和正確性的阻礙。

您的perf report輸出證實了這一點； GCC 正在對堆棧執行 32 位movss %xmm0,0xc(%rsp)存儲，然后執行 64 位重新加載mov 0xc(%rsp),%rax ，這將導致存儲轉發停止，從而導致額外延遲。 還有吞吐量損失，因為實際上您是在測試吞吐量，而不是延遲：逆 sqrt 的下一次計算只有一個恆定輸入，而不是前一次迭代的結果。 （ benchmark::DoNotOptimize包含一個"memory" clobber，它阻止 GCC/clang 將大部分計算提升到循環之外；他們必須假設number可能已經改變，因為它不是const 。）

像往常一樣，等待加載結果的指令（ sar ）是這些周期的罪魁禍首。 （當一個中斷觸發以在cycles事件計數器環繞時收集樣本時，CPU 必須找出一條指令應歸咎於該事件。通常這最終是等待較早的慢指令的一條指令，或者可能只是之后的一條指令即使沒有數據依賴，指令也很慢，我忘記了。）

Clang 選擇假設高 32 位為零，因此movd %xmm0, %eax僅使用 ALU uop 復制寄存器，而shr而不是sar因為它知道它從 64 位的高半部分移入零- 有點long ，它假裝與之合作。 （函數調用仍然使用%rdi ，所以這不是 Windows 叮當聲。）

修正版本：GCC 和 clang 做類似的 asm

修復問題中快速工作台鏈接上的代碼以使用int32_t和std::bit_cast ， https ://godbolt.org/z/qbxqsaW4e 顯示 GCC 和 clang 與-Ofast編譯類似，盡管不相同。 例如，GCC 加載number兩次，一次進入整數寄存器，一次進入 XMM0。 Clang 加載一次並使用movd eax, xmm2來獲取它。

在 QB（ https://quick-bench.com/q/jYLeX2krrTs0afjQKFp6Nm_G2v8 ）上，現在 GCC 的 BM_FastInverseSqrRoot 比原始版本快 2 倍，沒有-ffast-math

是的，由於 C++ 從sqrt(float)推斷sqrtf ，朴素的基准測試編譯為sqrtss / divss而沒有-ffast-math 。 它每次都會檢查數字是否>=0 ，因為 quick-bench 不允許使用-fno-math-errno進行編譯以省略該檢查以調用 libm 函數。 但是該分支預測完美，因此循環應該仍然很容易成為端口 0 吞吐量（div/sqrt 單位）的瓶頸。

Quick-bench 確實允許-Ofast ，這相當於-O3 -ffast-math ，它使用rsqrtss和牛頓迭代。 （使用 FMA 會更快，但 quick-bench 不允許-march=native或任何東西。我想可以使用__attribute__((target("avx,fma"))) 。

無論我是否使用，Quick-bench 現在都會通過權限錯誤映射頁面給出Error or timeout 。 並建議使用較小的-m/--mmap_pages以便我無法在該系統上進行測試。

帶有牛頓迭代的 rsqrt（就像編譯器為此使用的-Ofast一樣）可能更快或類似於 Quake 的快速 invsqrt，但精度約為 23 位。

為什么 clang 使 Quake 快速反平方根代碼比使用 GCC 快 10 倍？（帶有 (long)float 類型雙關語）

問題描述

1 個解決方案

解決方案1
7 已采納 2022-05-21 21:35:08

修正版本：GCC 和 clang 做類似的 asm

為什么 clang 使 Quake 快速反平方根代碼比使用 GCC 快 10 倍？ （帶有 *(long*)float 類型雙關語）

問題描述

1 個解決方案

解決方案1 7 已采納 2022-05-21 21:35:08

修正版本：GCC 和 clang 做類似的 asm

為什么 clang 使 Quake 快速反平方根代碼比使用 GCC 快 10 倍？（帶有 (long)float 類型雙關語）

解決方案1
7 已采納 2022-05-21 21:35:08