简体   繁体   English

分裂的速度更快? 双打/浮点数/ UInt32 / UInt64? 在C ++ / C中

[英]What is faster on division? doubles / floats / UInt32 / UInt64 ? in C++/C

I did some speed testing to figure out what is the fastest, when doing multiplication or division on numbers. 我做了一些速度测试,以确定在对数字进行乘法或除法时最快的速度。 I had to really work hard to defeat the optimiser. 我必须努力工作以击败优化者。 I got nonsensical results such as a massive loop operating in 2 microseconds, or that multiplication was the same speed as division (if only that were true). 我得到了无意义的结果,例如在2微秒内运行的大量循环,或者乘法与除法的速度相同(如果只是那样)。

After I finally worked hard enough to defeat enough of the compiler optimisations, while still letting it optimise for speed, I got these speed results. 在我最终努力工作以击败足够的编译器优化之后,同时仍然让它优化速度,我得到了这些速度结果。 They maybe of interest to someone else? 他们可能对别人感兴趣?

If my test is STILL FLAWED, let me know, but be kind seeing as I just spend two hours writing this crap :P 如果我的测试仍然悬而未决,请告诉我,但要善待,因为我只花了两个小时写这个垃圾:P

64 time: 3826718 us
32 time: 2476484 us
D(mul) time: 936524 us
D(div) time: 3614857 us
S time: 1506020 us

"Multiplying to divide" using doubles seems the fastest way to do a division, followed by integer division. 使用双打“乘以除法”似乎是进行除法的最快方法,其次是整数除法。 I did not test the accuracy of division. 我没有测试分裂的准确性。 Could it be that "proper division" is more accurate? 可能是“正确的划分”更准确吗? I have no desire to find out after these speed test results as I'll just be using integer division on a base 10 constant and letting my compiler optimise it for me ;) (and not defeating it's optimisations either). 我不想在这些速度测试结果之后发现,因为我只是在基数为10的常数上使用整数除法,让我的编译器为我优化它;)(并且不会破坏它的优化)。

Here's the code I used to get the results: 这是我用来获得结果的代码:

#include <iostream>

int Run(int bla, int div, int add, int minus) {
    // these parameters are to force the compiler to not be able to optimise away the
    // multiplications and divides :)
    long LoopMax = 100000000;

    uint32_t Origbla32 = 1000000000;
    long i = 0;

    uint32_t bla32 = Origbla32;
    uint32_t div32 = div;
    clock_t Time32 = clock();
    for (i = 0; i < LoopMax; i++) {
        div32 += add;
        div32 -= minus;
        bla32 = bla32 / div32;
        bla32 += bla;
        bla32 = bla32 * div32;
    }
    Time32 = clock() - Time32;

    uint64_t bla64 = bla32;
    clock_t Time64 = clock();
    uint64_t div64 = div;
    for (long i = 0; i < LoopMax; i++) {
        div64 += add;
        div64 -= minus;
        bla64 = bla64 / div64;
        bla64 += bla;
        bla64 = bla64 * div64;
    }
    Time64 = clock() - Time64;

    double blaDMul = Origbla32;
    double multodiv = 1.0 / (double)div;
    double multomul = div;
    clock_t TimeDMul = clock();
    for (i = 0; i < LoopMax; i++) {
        multodiv += add;
        multomul -= minus;
        blaDMul = blaDMul * multodiv;
        blaDMul += bla;
        blaDMul = blaDMul * multomul;
    }
    TimeDMul = clock() - TimeDMul;

    double blaDDiv = Origbla32;
    clock_t TimeDDiv = clock();
    for (i = 0; i < LoopMax; i++) {
        multodiv += add;
        multomul -= minus;
        blaDDiv = blaDDiv / multomul;
        blaDDiv += bla;
        blaDDiv = blaDDiv / multodiv;
    }
    TimeDDiv = clock() - TimeDDiv;

    float blaS = Origbla32;
    float divS = div;
    clock_t TimeS = clock();
    for (i = 0; i < LoopMax; i++) {
        divS += add;
        divS -= minus;
        blaS = blaS / divS;
        blaS += bla;
        blaS = blaS * divS;
    }
    TimeS = clock() - TimeS;

    printf("64 time: %i us  (%i)\n", (int)Time64, (int)bla64);
    printf("32 time: %i us  (%i)\n", (int)Time32, bla32);

    printf("D(mul) time: %i us  (%f)\n", (int)TimeDMul, blaDMul);
    printf("D(div) time: %i us  (%f)\n", (int)TimeDDiv, blaDDiv);
    printf("S time: %i us  (%f)\n", (int)TimeS, blaS);

    return 0;
}

int main(int argc, char* const argv[]) {
    Run(0, 10, 0, 0); // adds and minuses 0 so it doesn't affect the math, only kills the opts
    return 0;
}

There are lots of ways to perform certain arithmetic, so there might not be a single answer (shifting, fractional multiplication, actual division, some round-trip through a logarithm unit, etc; these might all have different relative costs depending on the operands and resource allocation). 有很多方法可以执行某些算术,所以可能没有一个答案(移位,小数乘法,实际除法,通过对数单位的某些往返等等;这些可能都有不同的相对成本,具体取决于操作数和资源分配)。

Let the compiler do its thing with the program and data flow information it has. 让编译器使用它具有的程序和数据流信息。

For some data applicable to assembly on x86, you might look at: "Instruction latencies and throughput for AMD and Intel x86 processors" 对于适用于x86上的汇编的一些数据,您可能会看到: “AMD和Intel x86处理器的指令延迟和吞吐量”

What is fastest will depend entirely on the target architecture. 最快的将完全取决于目标架构。 It looks here like you're interested only in the platform you happen to be on, which guessing from your execution times seems to be 64-bit x86, either Intel (Core2?) or AMD. 它看起来像你只对你碰巧在的平台感兴趣,从你的执行时间猜测似乎是64位x86,无论是Intel(Core2?)还是AMD。

That said, floating-point multiplication by the inverse will be the fastest on many platforms, but is, as you speculate, usually less accurate than a floating-point divide (two roundings instead of one -- whether or not that matters for your usage is a separate question). 也就是说,反向浮点乘法在许多平台上都是最快的,但正如你推测的那样,通常不如浮点除法(两次舍入而不是一次) - 无论这对你的使用是否重要是一个单独的问题)。 In general, you are better off re-arranging your algorithm to use fewer divides than you are jumping through hoops to make division as efficient as possible (the fastest division is the one you don't do), and make sure to benchmark before you spend time optimizing at all, as algorithms that bottleneck on division are few and far between. 一般来说,你最好重新安排你的算法使用较少的分数,而不是跳过箍来使分割尽可能高效(最快的分工是你不做的),并确保在你之前进行基准测试花费时间进行优化,因为划分分区的算法很少而且很远。

Also, if you have integer sources and need an integer result, make sure to include the cost of conversion between integer and floating-point in your benchmarking. 此外,如果您有整数源并需要整数结果,请确保在基准测试中包括整数和浮点之间的转换成本。

Since you're interested in timings on a specific machine, you should be aware that Intel now publishes this information in their Optimization Reference Manual (pdf) . 由于您对特定计算机上的计时感兴趣,因此您应该知道英特尔现在在其优化参考手册(pdf)中发布此信息。 Specifically, you will be interested in the tables of Appendix C section 3.1, "Latency and Throughput with Register Operands". 具体来说,您将对附录C第3.1节“注册操作数的延迟和吞吐量”表格感兴趣。

Be aware that integer divide timings depend strongly on the actual values involved. 请注意,整数除法时间很大程度上取决于所涉及的实际值。 Based on the information in that guide, it seems that your timing routines still have a fair bit of overhead, as the performance ratios you measure don't match up with Intel's published information. 根据该指南中的信息,您的计时程序似乎仍有相当大的开销,因为您测量的性能比率与英特尔公布的信息不符。

As Stephen mentioned, use the optimisation manual - but you should also be considering the use of SSE instructions. 正如Stephen提到的那样,使用优化手册 - 但您也应该考虑使用SSE指令。 These can do 4 or 8 divisions / multiplications in a single instruction. 这些可以在单个指令中进行4或8个分区/乘法。

Also, it is fairly common for a division to take a single clock cycle to process. 此外,分区采用单个时钟周期进行处理是相当普遍的。 The result may not be available for several clock cycles (called latency), however the next division can begin during this time (overlapping with the first) as long as it does not require the result from the first. 结果可能在几个时钟周期(称为延迟)中不可用,但是下一个除法可以在此期间开始(与第一个重叠),只要它不需要第一个的结果。 This is due to pipe-lining in the CPU, in the same way as you can wash more clothes while the previous load is still drying. 这是由于CPU中的管道衬里,就像在先前的负载仍在干燥时可以洗更多的衣服一样。

Multiplying to divide is a common trick, and should be used wherever your divisor changes infrequently. 乘以除法是一种常见的技巧,应该在你的除数不经常改变的地方使用。

There is a very good chance that you will spend time and effort making the maths fast only to discover that it is the speed of memory access (as you navigate the input and write the output) that limits your final implimentation. 您很有可能花费时间和精力使数学运算快速,只是发现内存访问的速度(当您导航输入并写入输出时)限制了您的最终实施。

I wrote a flawed test to do this on MSVC 2008 我在MSVC 2008上写了一个有缺陷的测试

double i32Time  = GetTime();
{
    volatile __int32 i = 4;
    __int32 count   = 0;
    __int32 max     = 1000000;
    while( count < max )
    {
        i /= 61;
        count++;
    }
}
i32Time = GetTime() - i32Time;

double i64Time  = GetTime();
{
    volatile __int64 i = 4;
    __int32 count   = 0;
    __int32 max     = 1000000;
    while( count < max )
    {
        i /= 61;
        count++;
    }
}
i64Time = GetTime() - i64Time;


double fTime    = GetTime();
{
    volatile float i = 4;
    __int32 count   = 0;
    __int32 max     = 1000000;
    while( count < max )
    {
        i /= 4.0f;
        count++;
    }
}
fTime   = GetTime() - fTime;

double fmTime   = GetTime();
{
    volatile float i = 4;
    const float div = 1.0f / 4.0f;
    __int32 count   = 0;
    __int32 max     = 1000000;
    while( count < max )
    {
        i *= div;
        count++;
    }
}
fmTime  = GetTime() - fmTime;

double dTime    = GetTime();
{
    volatile double i = 4;
    __int32 count   = 0;
    __int32 max     = 1000000;
    while( count < max )
    {
        i /= 4.0f;
        count++;
    }
}
dTime   = GetTime() - dTime;

double dmTime   = GetTime();
{
    volatile double i = 4;
    const double div = 1.0f / 4.0f;
    __int32 count   = 0;
    __int32 max     = 1000000;
    while( count < max )
    {
        i *= div;
        count++;
    }
}
dmTime  = GetTime() - dmTime;


DebugOutput( _T( "%f\n" ), i32Time );
DebugOutput( _T( "%f\n" ), i64Time );
DebugOutput( _T( "%f\n" ), fTime );
DebugOutput( _T( "%f\n" ), fmTime );
DebugOutput( _T( "%f\n" ), dTime );
DebugOutput( _T( "%f\n" ), dmTime );

DebugBreak();

I then ran it on an AMD64 Turion 64 in 32-bit mode. 然后我以32位模式在AMD64 Turion 64上运行它。 The results I got were as follows: 我得到的结果如下:

0.006622
0.054654
0.006283
0.006353
0.006203
0.006161

The reason the test is flawed is the usage of volatile which forces the compiler to re-load the variable from memory just in case its changed. 测试有缺陷的原因是使用volatile会强制编译器从内存中重新加载变量,以防万一它被更改。 All in it show there is precious little difference between any of the implementations on this machine (__int64 is obviously slow). 所有这些都表明这台机器上的任何实现之间没有什么区别(__int64显然很慢)。

It also categorically shows that the MSVC compiler performs the multiply by reciprocal optimisation. 它还明确地表明MSVC编译器通过倒数优化执行乘法运算。 I imagine GCC does the same if not better. 我想GCC会做同样的事情,如果不是更好的话。 If i change the float and double division checks to divide by "i" then it increases the time significantly. 如果我改变浮点数和双除法检验除以“i”那么它会显着增加时间。 Though, while a lot of that could be the re-loading from disk, it is obvious the compiler can't optimise that away so easily. 虽然,虽然很多可能是从磁盘重新加载,但很明显编译器无法轻易地优化它。

To understand such micro-optimisations try reading this pdf. 要了解这种微观优化,请尝试阅读此pdf。

All in I'd argue that if you are worrying about such things you obviously haven't profiled your code. 我所有人都认为,如果你担心这些事情,你显然还没有描述你的代码。 Profile and fix the problems as and when they actually ARE a problem. 在实际出现问题时,对问题进行概述和修复。

Agner Fog has done some pretty detailed measurements himself, which can be found here . Agner Fog自己做了一些非常详细的测量,可以在这里找到。 If you're really trying to optimize stuff, you should read the rest of the documents from his software optimization resources as well. 如果您真的想要优化内容,您还应该从他的软件优化资源中阅读其余文档。

I would point out that, even if you are measuring non-vectorized floating point operations, the compiler has two options for the generated assembly: it can use the FPU instructions ( fadd , fmul ) or it can use SSE instructions while still manipulate one floating point value per instruction ( addss , mulss ). 我要指出的是,即使您正在测量非向量化浮点运算,编译器也会为生成的汇编提供两个选项:它可以使用FPU指令( faddfmul ),也可以使用SSE指令同时仍然操作一个浮点运算每条指令的点值( addssmulss )。 In my experience the SSE instructions are faster and have less inaccuracies, but compilers don't make it the default because it could break compatibility with code that relies on the old behavior. 根据我的经验,SSE指令更快,并且具有更少的不准确性,但编译器不会将其作为默认值,因为它可能会破坏与依赖于旧行为的代码的兼容性。 You can turn it on in gcc with the -mfpmath=sse flag. 您可以使用-mfpmath=sse标志在gcc中打开它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM