简体   繁体   English

为什么标准的“abs”function 比我的快?

[英]Why is the standard “abs” function faster than mine?

I wanted to try making my own absolute value function.我想尝试制作自己的绝对值 function。 I figured that the fastest way to calculate absolute value would be to simply mask out the sign bit (the last bit in IEEE 754).我认为计算绝对值的最快方法是简单地屏蔽符号位(IEEE 754 中的最后一位)。 I wanted to compare it's speed to the standard abs function.我想将它的速度与标准abs function 进行比较。 Here is my implementation:这是我的实现:

// Union used for type punning
union float_uint_u
{
    float f_val;
    unsigned int ui_val;
};

// 'MASK' has all bits == 1 except the last one
constexpr unsigned int MASK = ~(1 << (sizeof(int) * 8 - 1));

float abs_bitwise(float value)
{
    float_uint_u ret;
    ret.f_val = value;
    ret.ui_val &= MASK;
       
    return ret.f_val;
}

For the record, I know that this sort of type punning is not standard C++.作为记录,我知道这种类型的双关语不是标准的 C++。 However, this is just for educational purposes, and according to the docs, this is supported in GCC .但是,这仅用于教育目的,根据文档, 这在 GCC 中得到支持

I figured this should be the fastest way to calculate absolute value, so it should at the very least be as fast as the standard implementation.我认为这应该是计算绝对值的最快方法,因此它至少应该与标准实现一样快。 However, timing 100000000 iterations of random values, I got the following results:但是,计时 100000000 次随机值迭代,我得到以下结果:

Bitwise time: 5.47385 | STL time: 5.15662
Ratio: 1.06152

My abs function is about 6% slower.我的abs function 慢了大约 6%。

Assembly output总成 output

I compiled with -O2 optimization and the -S option (assembly output) to help determine what was going on.我使用-O2优化和-S选项(汇编输出)进行编译,以帮助确定发生了什么。 I have extracted the relevant portions:我已经提取了相关部分:

; 16(%rsp) is a value obtained from standard input
movss   16(%rsp), %xmm0
andps   .LC5(%rip), %xmm0 ; .LC5 == 2147483647
movq    %rbp, %rdi
cvtss2sd    %xmm0, %xmm0

movl    16(%rsp), %eax
movq    %rbp, %rdi
andl    $2147483647, %eax
movd    %eax, %xmm0
cvtss2sd    %xmm0, %xmm0

Observations观察

I'm not great at assembly, but the main thing I noticed is that the standard function operates directly on the xmm0 register.我不擅长组装,但我注意到的主要是标准的 function 直接在xmm0寄存器上运行。 But with mine, it first moves the value to eax (for some reason), performs the and , and then moves it into xmm0 .但是对于我的,它首先将值移动到eax (出于某种原因),执行and ,然后将其移动到xmm0 I'm assuming the extra mov is where the slow down happens.我假设额外的mov是减速发生的地方。 I also noticed that, for the standard, it stores the bit mask elsewhere in the program vs an immediate.我还注意到,对于标准,它将位掩码存储在程序的其他位置而不是立即数。 I'm guessing that's not significant, however.不过,我猜这并不重要。 The two versions also use different instructions (eg movl vs movss ).这两个版本还使用不同的指令(例如movlmovss )。

System info系统信息

This was compiled with g++ on Debian Linux (unstable branch).这是在 Debian Linux(不稳定分支)上使用 g++ 编译的。 g++ --version output: g++ --version output:

g++ (Debian 10.2.1-6) 10.2.1 20210110

If these two versions of the code both calculate absolute value the same way (via an and ), why doesn't the optimizer generate the same code?如果这两个版本的代码都以相同的方式计算绝对值(通过and ),为什么优化器不生成相同的代码? Specifically, why does it feel the need to include an extra mov when it optimizes my implementation?具体来说,为什么在优化我的实现时感觉需要包含一个额外的mov

I got a bit different assembly.我的组装有点不同。 According to the x86_64 Linux ABI, a float argument is passed via xmm0 .根据 x86_64 Linux ABI, float参数通过xmm0传递。 With standard fabs , the bitwise AND operation is performed directly on this register (Intel syntax):使用标准fabs ,直接在此寄存器上执行按位AND运算(英特尔语法):

andps xmm0, XMMWORD PTR .LC0[rip] # .LC0 contains 0x7FFFFFFF
ret

However, in your case, the bitwise AND is performed on objects of type unsigned int .但是,在您的情况下,对unsigned int类型的对象执行按位AND Therefore, GCC does the same which requires to move xmm0 to eax first:因此,GCC 做同样的事情,需要xmm0移动到eax

movd eax, xmm0
and  eax, 2147483647
movd xmm0, eax
ret

Live demo: https://godbolt.org/z/xj8MMo现场演示: https://godbolt.org/z/xj8MMo

I haven't found any way how to force the GCC optimizer to perform AND directly on xmm0 with only pure C/C++ source code.我还没有找到任何方法如何强制GCC优化器仅使用纯 C/C++ 源代码直接在xmm0上执行AND It seems that efficient implementations need to be built upon assembler code or Intel intrinsic.似乎需要在汇编代码或 Intel 内在代码上构建有效的实现。

Relevant question: How to perform a bitwise operation on floating point numbers .相关问题:如何对浮点数执行按位运算 All the proposed solutions basically result in the same outcome.所有提出的解决方案基本上都会产生相同的结果。

I also tried to use the copysign function, but the result was even worse.我也尝试使用copysign function,但结果更糟。 The generated machine code then conatiend x87 instructions.然后生成的机器代码包含 x87 指令。


Anyway, it is quite interesting that the Clang optimizer was clever enough to make the assembly in all 3 cases equivalent: https://godbolt.org/z/b6Khv5 .无论如何,有趣的是, Clang优化器足够聪明,可以使所有 3 种情况下的组件等效: https://godbolt.org/z/b6Khv5

Why is the standard “abs” function faster than mine?为什么标准的“abs”function 比我的快?

Because with most optimizing compilers (in particular GCC or Clang ), it would use a specialized machine instruction known by the compiler因为对于大多数优化编译器(特别是GCCClang ),它将使用编译器已知的专用机器指令

The GCC compiler has even a builtin for abs GCC 编译器甚至有一个内置abs

Be sure to compile with gcc -O3 and perhaps -ffast-math .请务必使用gcc -O3-ffast-math进行编译。

You could study the assembler code: compile your example.c as gcc -Wall -O3 -ffast-math -fverbose-asm example.c and look inside the emitted example.s assembler file.您可以研究汇编代码:编译您的example.cgcc -Wall -O3 -ffast-math -fverbose-asm example.c example.s汇编文件中的示例。

On Linux systems (eg Debian ), you could study the source code of GNU libc and look inside the math.h standard header (and use g++ -O3 -C -E to get the preprocessed form) On Linux systems (eg Debian ), you could study the source code of GNU libc and look inside the math.h standard header (and use g++ -O3 -C -E to get the preprocessed form)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM