[英]Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with *(long*)float type punning)
I'm trying to benchmark the fast inverse square root .我正在尝试对快速反平方根进行基准测试。 The full code is here:完整代码在这里:
#include <benchmark/benchmark.h>
#include <math.h>
float number = 30942;
static void BM_FastInverseSqrRoot(benchmark::State &state) {
for (auto _ : state) {
// from wikipedia:
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y;
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) );
// y = y * ( threehalfs - ( x2 * y * y ) );
float result = y;
benchmark::DoNotOptimize(result);
}
}
static void BM_InverseSqrRoot(benchmark::State &state) {
for (auto _ : state) {
float result = 1 / sqrt(number);
benchmark::DoNotOptimize(result);
}
}
BENCHMARK(BM_FastInverseSqrRoot);
BENCHMARK(BM_InverseSqrRoot);
and here is the code in quick-bench if you want to run it yourself.如果您想自己运行,这里是 quick-bench 中的代码。
Compiling with GCC 11.2 and -O3, the BM_FastInverseSqrRoot is around 31 times slower than Noop (around 10 ns when I ran it locally on my machine).使用 GCC 11.2 和 -O3 编译,BM_FastInverseSqrRoot 比 Noop 慢大约 31 倍(当我在我的机器上本地运行它时大约 10 ns)。 Compiling with Clang 13.0 and -O3, it is around 3.6 times slower than Noop (around 1 ns when I ran it locally on my machine).使用 Clang 13.0 和 -O3 编译,它比 Noop 慢大约 3.6 倍(当我在我的机器上本地运行它时大约 1 ns)。 This is a 10x speed difference.这是 10 倍的速度差异。
Here is the relevant Assembly (taken from quick-bench).这是相关的程序集(取自快速工作台)。
With GCC:使用 GCC:
push %rbp
mov %rdi,%rbp
push %rbx
sub $0x18,%rsp
cmpb $0x0,0x1a(%rdi)
je 408c98 <BM_FastInverseSqrRoot(benchmark::State&)+0x28>
callq 40a770 <benchmark::State::StartKeepRunning()>
408c84 add $0x18,%rsp
mov %rbp,%rdi
pop %rbx
pop %rbp
jmpq 40aa20 <benchmark::State::FinishKeepRunning()>
nopw 0x0(%rax,%rax,1)
408c98 mov 0x10(%rdi),%rbx
callq 40a770 <benchmark::State::StartKeepRunning()>
test %rbx,%rbx
je 408c84 <BM_FastInverseSqrRoot(benchmark::State&)+0x14>
movss 0x1b386(%rip),%xmm4 # 424034 <_IO_stdin_used+0x34>
movss 0x1b382(%rip),%xmm3 # 424038 <_IO_stdin_used+0x38>
mov $0x5f3759df,%edx
nopl 0x0(%rax,%rax,1)
408cc0 movss 0x237a8(%rip),%xmm0 # 42c470 <number>
mov %edx,%ecx
movaps %xmm3,%xmm1
2.91% movss %xmm0,0xc(%rsp)
mulss %xmm4,%xmm0
mov 0xc(%rsp),%rax
44.70% sar %rax
3.27% sub %eax,%ecx
3.24% movd %ecx,%xmm2
3.27% mulss %xmm2,%xmm0
9.58% mulss %xmm2,%xmm0
10.00% subss %xmm0,%xmm1
10.03% mulss %xmm2,%xmm1
9.64% movss %xmm1,0x8(%rsp)
3.33% sub $0x1,%rbx
jne 408cc0 <BM_FastInverseSqrRoot(benchmark::State&)+0x50>
add $0x18,%rsp
mov %rbp,%rdi
pop %rbx
pop %rbp
408d0a jmpq 40aa20 <benchmark::State::FinishKeepRunning()>
With Clang:使用铿锵声:
push %rbp
push %r14
push %rbx
sub $0x10,%rsp
mov %rdi,%r14
mov 0x1a(%rdi),%bpl
mov 0x10(%rdi),%rbx
call 213a80 <benchmark::State::StartKeepRunning()>
test %bpl,%bpl
jne 212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
test %rbx,%rbx
je 212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
movss -0xf12e(%rip),%xmm0 # 203cec <_IO_stdin_used+0x8>
movss -0xf13a(%rip),%xmm1 # 203ce8 <_IO_stdin_used+0x4>
cs nopw 0x0(%rax,%rax,1)
nopl 0x0(%rax)
212e30 2.46% movd 0x3c308(%rip),%xmm2 # 24f140 <number>
4.83% movd %xmm2,%eax
8.07% mulss %xmm0,%xmm2
12.35% shr %eax
2.60% mov $0x5f3759df,%ecx
5.15% sub %eax,%ecx
8.02% movd %ecx,%xmm3
11.53% mulss %xmm3,%xmm2
3.16% mulss %xmm3,%xmm2
5.71% addss %xmm1,%xmm2
8.19% mulss %xmm3,%xmm2
16.44% movss %xmm2,0xc(%rsp)
11.50% add $0xffffffffffffffff,%rbx
jne 212e30 <BM_FastInverseSqrRoot(benchmark::State&)+0x40>
212e69 mov %r14,%rdi
call 213af0 <benchmark::State::FinishKeepRunning()>
add $0x10,%rsp
pop %rbx
pop %r14
pop %rbp
212e79 ret
They look pretty similar to me.他们看起来和我很相似。 Both seem to be using SIMD registers/instructions like mulss
.两者似乎都在使用像mulss
这样的 SIMD 寄存器/指令。 The GCC version has a sar
that is supposedly taking 46%? GCC 版本的sar
应该占 46%? (But I think it's just mislabelled and it's the mulss, mov, sar
that together take 46%). (但我认为它只是贴错了标签,而mulss, mov, sar
共占 46%)。 Anyway, I'm not familiar enough with Assembly to really tell what is causing such a huge performance difference.无论如何,我对 Assembly 还不够熟悉,无法真正说出造成如此巨大性能差异的原因。
Anyone know?有人知道吗?
Just FYI, Is it still worth using the Quake fast inverse square root algorithm nowadays on x86-64?仅供参考, 现在还值得在 x86-64 上使用 Quake 快速逆平方根算法吗? - no, obsoleted by SSE1 rsqrtss
which you can use with or without a Newton iteration. - 不,已被 SSE1 rsqrtss
,您可以使用或不使用牛顿迭代。
As people pointed out in comments, you're using 64-bit long
(since this is x86-64 on a non-Windows system), pointing it at a 32-bit float
.正如人们在评论中指出的那样,您使用的是 64 位long
(因为这是非 Windows 系统上的 x86-64),将其指向 32 位float
。 So as well as a strict-aliasing violation (use memcpy
or std::bit_cast<int32_t>(myfloat)
for type punning), that's a showstopper for performance as well as correctness.因此,除了严格的混叠违规(使用memcpy
或std::bit_cast<int32_t>(myfloat)
进行类型双关语)之外,这也是性能和正确性的阻碍。
Your perf report
output confirms it;您的perf report
输出证实了这一点; GCC is doing a 32-bit movss %xmm0,0xc(%rsp)
store to the stack, then a 64-bit reload mov 0xc(%rsp),%rax
, which will cause a store forwarding stall costing much extra latency. GCC 正在对堆栈执行 32 位movss %xmm0,0xc(%rsp)
存储,然后执行 64 位重新加载mov 0xc(%rsp),%rax
,这将导致存储转发停止,从而导致额外延迟。 And a throughput penalty, since actually you're testing throughput, not latency: the next computation of an inverse sqrt only has a constant input, not the result of the previous iteration.还有吞吐量损失,因为实际上您是在测试吞吐量,而不是延迟:逆 sqrt 的下一次计算只有一个恒定输入,而不是前一次迭代的结果。 ( benchmark::DoNotOptimize
contains a "memory"
clobber which stops GCC/clang from hoisting most of the computation out of the loop; they have to assume number
may have changed since it's not const
.) ( benchmark::DoNotOptimize
包含一个"memory"
clobber,它阻止 GCC/clang 将大部分计算提升到循环之外;他们必须假设number
可能已经改变,因为它不是const
。)
The instruction waiting for the load result (the sar
) is getting the blame for those cycles, as usual.像往常一样,等待加载结果的指令( sar
)是这些周期的罪魁祸首。 (When an interrupt fires to collect a sample upon the cycles
event counter wrapping around, the CPU has to figure out one instruction to blame for that event. Usually this ends up being the one waiting for an earlier slow instruction, or maybe just one after a slow instruction even without a data dependency, I forget.) (当一个中断触发以在cycles
事件计数器环绕时收集样本时,CPU 必须找出一条指令应归咎于该事件。通常这最终是等待较早的慢指令的一条指令,或者可能只是之后的一条指令即使没有数据依赖,指令也很慢,我忘记了。)
Clang chooses to assume that the upper 32 bits are zero, thus movd %xmm0, %eax
to just copy the register with an ALU uop, and the shr
instead of sar
because it knows it's shifting in a zero from the high half of the 64-bit long
it's pretending to work with. Clang 选择假设高 32 位为零,因此movd %xmm0, %eax
仅使用 ALU uop 复制寄存器,而shr
而不是sar
因为它知道它从 64 位的高半部分移入零- 有点long
,它假装与之合作。 (A function call still used %rdi
so that isn't Windows clang.) (函数调用仍然使用%rdi
,所以这不是 Windows 叮当声。)
Fixing the code on the quick-bench link in the question to use int32_t
and std::bit_cast
, https://godbolt.org/z/qbxqsaW4e shows GCC and clang compile similarly with -Ofast
, although not identical.修复问题中快速工作台链接上的代码以使用int32_t
和std::bit_cast
, https ://godbolt.org/z/qbxqsaW4e 显示 GCC 和 clang 与-Ofast
编译类似,尽管不相同。 eg GCC loads number
twice, once into an integer register, once into XMM0.例如,GCC 加载number
两次,一次进入整数寄存器,一次进入 XMM0。 Clang loads once and uses movd eax, xmm2
to get it. Clang 加载一次并使用movd eax, xmm2
来获取它。
On QB ( https://quick-bench.com/q/jYLeX2krrTs0afjQKFp6Nm_G2v8 ), now GCC's BM_FastInverseSqrRoot is faster by a factor of 2 than the naive version, without -ffast-math
在 QB( https://quick-bench.com/q/jYLeX2krrTs0afjQKFp6Nm_G2v8 )上,现在 GCC 的 BM_FastInverseSqrRoot 比原始版本快 2 倍,没有-ffast-math
And yes, the naive benchmark compiles to sqrtss
/ divss
without -ffast-math
, thanks to C++ inferring sqrtf
from sqrt(float)
.是的,由于 C++ 从sqrt(float)
推断sqrtf
,朴素的基准测试编译为sqrtss
/ divss
而没有-ffast-math
。 It does check for the number being >=0
every time, since quick-bench doesn't allow compiling with -fno-math-errno
to omit that check to maybe call the libm function.它每次都会检查数字是否>=0
,因为 quick-bench 不允许使用-fno-math-errno
进行编译以省略该检查以调用 libm 函数。 But that branch predicts perfectly so the loop should still easily just bottleneck on port 0 throughput (div/sqrt unit).但是该分支预测完美,因此循环应该仍然很容易成为端口 0 吞吐量(div/sqrt 单位)的瓶颈。
Quick-bench does allow -Ofast
, which is equivalent to -O3 -ffast-math
, which uses rsqrtss
and a Newton iteration. Quick-bench 确实允许-Ofast
,这相当于-O3 -ffast-math
,它使用rsqrtss
和牛顿迭代。 (Would be even faster with FMA available, but quick-bench doesn't allow -march=native
or anything. I guess one could use __attribute__((target("avx,fma")))
. (使用 FMA 会更快,但 quick-bench 不允许-march=native
或任何东西。我想可以使用__attribute__((target("avx,fma")))
。
Quick-bench is now giving Error or timeout
whether I use that or not, with Permission error mapping pages.无论我是否使用,Quick-bench 现在都会通过权限错误映射页面给出Error or timeout
。 and suggesting a smaller -m/--mmap_pages
so I can't test on that system.并建议使用较小的-m/--mmap_pages
以便我无法在该系统上进行测试。
rsqrt with a Newton iteration (like compilers use at -Ofast
for this) is probably faster or similar to Quake's fast invsqrt, but with about 23 bits of precision.带有牛顿迭代的 rsqrt(就像编译器为此使用的-Ofast
一样)可能更快或类似于 Quake 的快速 invsqrt,但精度约为 23 位。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.