简体   繁体   English

为什么将 0.1f 更改为 0 会使性能降低 10 倍?

[英]Why does changing 0.1f to 0 slow down performance by 10x?

Why does this bit of code,为什么这段代码,

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0.1f; // <--
        y[i] = y[i] - 0.1f; // <--
    }
}

run more than 10 times faster than the following bit (identical except where noted)?运行速度比以下位快 10 倍以上(除另有说明外相同)?

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0; // <--
        y[i] = y[i] - 0; // <--
    }
}

when compiling with Visual Studio 2010 SP1.使用 Visual Studio 2010 SP1 编译时。 The optimization level was -02 with sse2 enabled.启用sse2的优化级别为-02 I haven't tested with other compilers.我没有用其他编译器测试过。

Welcome to the world of denormalized floating-point !欢迎来到非规范化浮点的世界! They can wreak havoc on performance!!!他们可以对性能造成严重破坏!!!

Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation.非正规(或次正规)数字是一种从浮点表示中获得一些非常接近零的额外值的技巧。 Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point.非规范化浮点运算可能比规范化浮点运算慢数十到数百倍 This is because many processors can't handle them directly and must trap and resolve them using microcode.这是因为许多处理器无法直接处理它们,必须使用微码捕获和解析它们。

If you print out the numbers after 10,000 iterations, you will see that they have converged to different values depending on whether 0 or 0.1 is used.如果在 10,000 次迭代后打印出这些数字,您将看到它们已经收敛到不同的值,具体取决于使用的是0还是0.1

Here's the test code compiled on x64:这是在 x64 上编译的测试代码:

int main() {

    double start = omp_get_wtime();

    const float x[16]={1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
    const float z[16]={1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
    float y[16];
    for(int i=0;i<16;i++)
    {
        y[i]=x[i];
    }
    for(int j=0;j<9000000;j++)
    {
        for(int i=0;i<16;i++)
        {
            y[i]*=x[i];
            y[i]/=z[i];
#ifdef FLOATING
            y[i]=y[i]+0.1f;
            y[i]=y[i]-0.1f;
#else
            y[i]=y[i]+0;
            y[i]=y[i]-0;
#endif

            if (j > 10000)
                cout << y[i] << "  ";
        }
        if (j > 10000)
            cout << endl;
    }

    double end = omp_get_wtime();
    cout << end - start << endl;

    system("pause");
    return 0;
}

Output:输出:

#define FLOATING
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007

//#define FLOATING
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.46842e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.45208e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044

Note how in the second run the numbers are very close to zero.请注意,在第二次运行中,数字非常接近于零。

Denormalized numbers are generally rare and thus most processors don't try to handle them efficiently.非规范化数字通常很少见,因此大多数处理器不会尝试有效地处理它们。


To demonstrate that this has everything to do with denormalized numbers, if we flush denormals to zero by adding this to the start of the code:为了证明这与非规范化数字有关,如果我们通过将其添加到代码的开头将非规范化数刷新为零

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

Then the version with 0 is no longer 10x slower and actually becomes faster.然后带有0的版本不再慢 10 倍,实际上变得更快。 (This requires that the code be compiled with SSE enabled.) (这要求在启用 SSE 的情况下编译代码。)

This means that rather than using these weird lower precision almost-zero values, we just round to zero instead.这意味着我们不是使用这些奇怪的低精度几乎为零的值,而是将其舍入为零。

Timings: Core i7 920 @ 3.5 GHz:时序:Core i7 920 @ 3.5 GHz:

//  Don't flush denormals to zero.
0.1f: 0.564067
0   : 26.7669

//  Flush denormals to zero.
0.1f: 0.587117
0   : 0.341406

In the end, this really has nothing to do with whether it's an integer or floating-point.最后,这真的与它是整数还是浮点数无关。 The 0 or 0.1f is converted/stored into a register outside of both loops. 00.1f被转换/存储到两个循环之外的寄存器中。 So that has no effect on performance.所以这对性能没有影响。

Using gcc and applying a diff to the generated assembly yields only this difference:使用gcc并对生成的程序集应用差异只会产生以下差异:

73c68,69
<   movss   LCPI1_0(%rip), %xmm1
---
>   movabsq $0, %rcx
>   cvtsi2ssq   %rcx, %xmm1
81d76
<   subss   %xmm1, %xmm0

The cvtsi2ssq one being 10 times slower indeed. cvtsi2ssq确实慢了 10 倍。

Apparently, the float version uses an XMM register loaded from memory, while the int version converts a real int value 0 to float using the cvtsi2ssq instruction, taking a lot of time.显然, float版本使用从内存加载的XMM寄存器,而int版本使用cvtsi2ssq指令将真正的int值 0 转换为float ,花费了大量时间。 Passing -O3 to gcc doesn't help.-O3传递给 gcc 没有帮助。 (gcc version 4.2.1.) (gcc 版本 4.2.1。)

(Using double instead of float doesn't matter, except that it changes the cvtsi2ssq into a cvtsi2sdq .) (使用double而不是float无关紧要,只是它将cvtsi2ssq更改为cvtsi2sdq 。)

Update更新

Some extra tests show that it is not necessarily the cvtsi2ssq instruction.一些额外的测试表明它不一定是cvtsi2ssq指令。 Once eliminated (using a int ai=0;float a=ai; and using a instead of 0 ), the speed difference remains.一旦消除(使用int ai=0;float a=ai;并使用a而不是0 ),速度差异仍然存在。 So @Mysticial is right, the denormalized floats make the difference.所以@Mysticial 是对的,非规范化的浮点数有所不同。 This can be seen by testing values between 0 and 0.1f .这可以通过测试00.1f之间的值来看出。 The turning point in the above code is approximately at 0.00000000000000000000000000000001 , when the loops suddenly takes 10 times as long.上述代码中的转折点大约在0.00000000000000000000000000000001 ,此时循环的时间突然增加了 10 倍。

Update<<1更新<<1

A small visualisation of this interesting phenomenon:这个有趣现象的一个小可视化:

  • Column 1: a float, divided by 2 for every iteration第 1 列:浮点数,每次迭代除以 2
  • Column 2: the binary representation of this float第 2 列:此浮点数的二进制表示
  • Column 3: the time taken to sum this float 1e7 times第 3 列:将这个浮点数相加 1e7 次所花费的时间

You can clearly see the exponent (the last 9 bits) change to its lowest value, when denormalization sets in. At that point, simple addition becomes 20 times slower.当非规范化开始时,您可以清楚地看到指数(最后 9 位)变为最低值。此时,简单的加法会慢 20 倍。

0.000000000000000000000000000000000100000004670110: 10111100001101110010000011100000 45 ms
0.000000000000000000000000000000000050000002335055: 10111100001101110010000101100000 43 ms
0.000000000000000000000000000000000025000001167528: 10111100001101110010000001100000 43 ms
0.000000000000000000000000000000000012500000583764: 10111100001101110010000110100000 42 ms
0.000000000000000000000000000000000006250000291882: 10111100001101110010000010100000 48 ms
0.000000000000000000000000000000000003125000145941: 10111100001101110010000100100000 43 ms
0.000000000000000000000000000000000001562500072970: 10111100001101110010000000100000 42 ms
0.000000000000000000000000000000000000781250036485: 10111100001101110010000111000000 42 ms
0.000000000000000000000000000000000000390625018243: 10111100001101110010000011000000 42 ms
0.000000000000000000000000000000000000195312509121: 10111100001101110010000101000000 43 ms
0.000000000000000000000000000000000000097656254561: 10111100001101110010000001000000 42 ms
0.000000000000000000000000000000000000048828127280: 10111100001101110010000110000000 44 ms
0.000000000000000000000000000000000000024414063640: 10111100001101110010000010000000 42 ms
0.000000000000000000000000000000000000012207031820: 10111100001101110010000100000000 42 ms
0.000000000000000000000000000000000000006103515209: 01111000011011100100001000000000 789 ms
0.000000000000000000000000000000000000003051757605: 11110000110111001000010000000000 788 ms
0.000000000000000000000000000000000000001525879503: 00010001101110010000100000000000 788 ms
0.000000000000000000000000000000000000000762939751: 00100011011100100001000000000000 795 ms
0.000000000000000000000000000000000000000381469876: 01000110111001000010000000000000 896 ms
0.000000000000000000000000000000000000000190734938: 10001101110010000100000000000000 813 ms
0.000000000000000000000000000000000000000095366768: 00011011100100001000000000000000 798 ms
0.000000000000000000000000000000000000000047683384: 00110111001000010000000000000000 791 ms
0.000000000000000000000000000000000000000023841692: 01101110010000100000000000000000 802 ms
0.000000000000000000000000000000000000000011920846: 11011100100001000000000000000000 809 ms
0.000000000000000000000000000000000000000005961124: 01111001000010000000000000000000 795 ms
0.000000000000000000000000000000000000000002980562: 11110010000100000000000000000000 835 ms
0.000000000000000000000000000000000000000001490982: 00010100001000000000000000000000 864 ms
0.000000000000000000000000000000000000000000745491: 00101000010000000000000000000000 915 ms
0.000000000000000000000000000000000000000000372745: 01010000100000000000000000000000 918 ms
0.000000000000000000000000000000000000000000186373: 10100001000000000000000000000000 881 ms
0.000000000000000000000000000000000000000000092486: 01000010000000000000000000000000 857 ms
0.000000000000000000000000000000000000000000046243: 10000100000000000000000000000000 861 ms
0.000000000000000000000000000000000000000000022421: 00001000000000000000000000000000 855 ms
0.000000000000000000000000000000000000000000011210: 00010000000000000000000000000000 887 ms
0.000000000000000000000000000000000000000000005605: 00100000000000000000000000000000 799 ms
0.000000000000000000000000000000000000000000002803: 01000000000000000000000000000000 828 ms
0.000000000000000000000000000000000000000000001401: 10000000000000000000000000000000 815 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 44 ms

An equivalent discussion about ARM can be found in Stack Overflow question Denormalized floating point in Objective-C?关于 ARM 的等效讨论可以在 Stack Overflow 问题Denormalized floating point in Objective-C 中找到? . .

It's due to denormalized floating-point use.这是由于非规范化的浮点使用。 How to get rid of both it and the performance penalty?如何摆脱它和性能损失? Having scoured the Internet for ways of killing denormal numbers, it seems there is no "best" way to do this yet.在互联网上搜索了杀死非正规数的方法后,似乎还没有“最佳”方法可以做到这一点。 I have found these three methods that may work best in different environments:我发现这三种方法可能在不同的环境中效果最好:

  • Might not work in some GCC environments:在某些 GCC 环境中可能不起作用:

     // Requires #include <fenv.h> fesetenv(FE_DFL_DISABLE_SSE_DENORMS_ENV);
  • Might not work in some Visual Studio environments: 1在某些 Visual Studio 环境中可能不起作用: 1

     // Requires #include <xmmintrin.h> _mm_setcsr( _mm_getcsr() | (1<<15) | (1<<6) ); // Does both FTZ and DAZ bits. You can also use just hex value 0x8040 to do both. // You might also want to use the underflow mask (1<<11)
  • Appears to work in both GCC and Visual Studio:似乎在 GCC 和 Visual Studio 中都可以使用:

     // Requires #include <xmmintrin.h> // Requires #include <pmmintrin.h> _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
  • The Intel compiler has options to disable denormals by default on modern Intel CPUs.默认情况下,英特尔编译器具有在现代英特尔 CPU 上禁用非规范化的选项。 More details here 更多细节在这里

  • Compiler switches.编译器开关。 -ffast-math , -msse or -mfpmath=sse will disable denormals and make a few other things faster, but unfortunately also do lots of other approximations that might break your code. -ffast-math-msse-mfpmath=sse将禁用非正规-mfpmath=sse并使其他一些事情更快,但不幸的是,也会执行许多其他可能破坏您的代码的近似值。 Test carefully!仔细测试! The equivalent of fast-math for the Visual Studio compiler is /fp:fast but I haven't been able to confirm whether this also disables denormals. Visual Studio 编译器的 fast-math 的等效项是/fp:fast但我无法确认这是否也禁用了非规范化。 1 1

In gcc you can enable FTZ and DAZ with this:在 gcc 中,您可以通过以下方式启用 FTZ 和 DAZ:

#include <xmmintrin.h>

#define FTZ 1
#define DAZ 1   

void enableFtzDaz()
{
    int mxcsr = _mm_getcsr ();

    if (FTZ) {
            mxcsr |= (1<<15) | (1<<11);
    }

    if (DAZ) {
            mxcsr |= (1<<6);
    }

    _mm_setcsr (mxcsr);
}

also use gcc switches: -msse -mfpmath=sse也使用 gcc 开关:-msse -mfpmath=sse

(corresponding credits to Carl Hetherington [1]) (相应的学分来自 Carl Hetherington [1])

[1] http://carlh.net/plugins/denormals.php [1] http://carlh.net/plugins/denormals.php

Dan Neely's comment ought to be expanded into an answer: Dan Neely 的评论应该扩展为答案:

It is not the zero constant 0.0f that is denormalized or causes a slow down, it is the values that approach zero each iteration of the loop.不是非规范化或导致减速的零常数0.0f ,而是每次循环迭代接近零的值。 As they come closer and closer to zero, they need more precision to represent and they become denormalized.随着它们越来越接近于零,它们需要更高的精度来表示,并且它们变得非规范化。 These are the y[i] values.这些是y[i]值。 (They approach zero because x[i]/z[i] is less than 1.0 for all i .) (它们接近于零,因为x[i]/z[i]对于所有i都小于 1.0。)

The crucial difference between the slow and fast versions of the code is the statement y[i] = y[i] + 0.1f;代码的慢速版本和快速版本之间的关键区别在于语句y[i] = y[i] + 0.1f; . . As soon as this line is executed each iteration of the loop, the extra precision in the float is lost, and the denormalization needed to represent that precision is no longer needed.只要在循环的每次迭代中执行此行,浮点数中的额外精度就会丢失,并且不再需要表示该精度所需的非规范化。 Afterwards, floating point operations on y[i] remain fast because they aren't denormalized.之后, y[i]上的浮点运算仍然很快,因为它们没有被非规范化。

Why is the extra precision lost when you add 0.1f ?为什么添加0.1f时会丢失额外的精度? Because floating point numbers only have so many significant digits.因为浮点数只有这么多有效数字。 Say you have enough storage for three significant digits, then 0.00001 = 1e-5 , and 0.00001 + 0.1 = 0.1 , at least for this example float format, because it doesn't have room to store the least significant bit in 0.10001 .假设您有足够的存储空间来存储三位有效数字,然后0.00001 = 1e-50.00001 + 0.1 = 0.1 ,至少对于此示例浮点格式,因为它没有空间将最低有效位存储在0.10001

In short, y[i]=y[i]+0.1f; y[i]=y[i]-0.1f;简而言之, y[i]=y[i]+0.1f; y[i]=y[i]-0.1f; y[i]=y[i]+0.1f; y[i]=y[i]-0.1f; isn't the no-op you might think it is.不是您可能认为的无操作。

Mystical said this as well : the content of the floats matters, not just the assembly code. Mystical 也这么说:浮点数的内容很重要,而不仅仅是汇编代码。

EDIT: To put a finer point on this, not every floating point operation takes the same amount of time to run, even if the machine opcode is the same.编辑:为了更好地说明这一点,即使机器操作码相同,也不是每个浮点运算都需要相同的时间来运行。 For some operands/inputs, the same instruction will take more time to run.对于某些操作数/输入,相同的指令将需要更多时间来运行。 This is especially true for denormal numbers.对于非正规数尤其如此。

CPUs are only a bit slower for denormal numbers for a long time.很长一段时间内,CPU 对于非正规数只会稍微慢一点。 My Zen2 CPU needs five clock cycles for a computation with denormal inputs and denormal outputs and four clock cycles with a normalized number.我的 Zen2 CPU 需要五个时钟周期来进行非正规输入和非正规输出的计算,以及四个时钟周期和标准化数字。

This is a small benchmark written with Visual C++ to show the slightly peformance-degrading effect of denormal numbers:这是一个用 Visual C++ 编写的小型基准测试,用于显示非正规数对性能的轻微影响:

#include <iostream>
#include <cstdint>
#include <chrono>

using namespace std;
using namespace chrono;

uint64_t denScale( uint64_t rounds, bool den );

int main()
{
    auto bench = []( bool den ) -> double
    {
        constexpr uint64_t ROUNDS = 25'000'000;
        auto start = high_resolution_clock::now();
        int64_t nScale = denScale( ROUNDS, den );
        return (double)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / nScale;
    };
    double
        tDen = bench( true ),
        tNorm = bench( false ),
        rel = tDen / tNorm - 1;
    cout << tDen << endl;
    cout << tNorm << endl;
    cout << trunc( 100 * 10 * rel + 0.5 ) / 10 << "%" << endl;
}

This is the MASM assembly part.这是 MASM 组装部件。

PUBLIC ?denScale@@YA_K_K_N@Z

CONST SEGMENT
DEN DQ 00008000000000000h
ONE DQ 03FF0000000000000h
P5  DQ 03fe0000000000000h
CONST ENDS

_TEXT SEGMENT
?denScale@@YA_K_K_N@Z PROC
    xor     rax, rax
    test    rcx, rcx
    jz      byeBye
    mov     r8, ONE
    mov     r9, DEN
    test    dl, dl
    cmovnz  r8, r9
    movq    xmm1, P5
    mov     rax, rcx
loopThis:
    movq    xmm0, r8
REPT 52
    mulsd   xmm0, xmm1
ENDM
    sub     rcx, 1
    jae     loopThis
    mov     rdx, 52
    mul     rdx
byeBye:
    ret
?denScale@@YA_K_K_N@Z ENDP
_TEXT ENDS
END

It would be nice to see some results in the comments.很高兴在评论中看到一些结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 数组中不同的浮点值会影响性能 10 倍 - 为什么? - Different float values in array impact performance by 10x - why? 为什么 clang 使 Quake 快速反平方根代码比使用 GCC 快 10 倍? (带有 *(long*)float 类型双关语) - Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with *(long*)float type punning) 使用 glm::ortho(-width / 2, width / 2, -height/2, height, 0.1f, 10.0f) 时 glm::ortho() 拉伸; - glm::ortho() streches when using glm::ortho(-width / 2, width / 2, -height/2, height, 0.1f, 10.0f); x * 0.1 和 x / 10 之间的差异? - Difference between x * 0.1 and x / 10? 为什么一个类中相同函数定义的执行时间慢于10倍以上? - Why the execution time of same function definition within a class is slower more than 10x time? std::fstream 缓冲 vs 手动缓冲(为什么手动缓冲增益 10 倍)? - std::fstream buffering vs manual buffering (why 10x gain with manual buffering)? 为什么这种“优化”会使我的程序变慢? - Why does this “optimization” slow down my program? 操作员超载会降低性能吗? - is the operator overloading slow down performance? 为什么连续初始化std :: regex对象会使程序变慢? - Why does a continuous initialization of a std::regex object slow down the program? 为什么Sleep()会使后续代码减速40ms? - Why does Sleep() slow down subsequent code for 40ms?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM