简体   繁体   English

Double vs Float vs _Float16(运行时间)

[英]Double vs Float vs _Float16 (Running Time)

I have a simple question in C language.我有一个 C 语言的简单问题。 I am implementing a half-precision software using _Float16 in C (My mac is based on ARM), but running time is not quite faster than single or double-precision software.我正在使用 C 中的 _Float16 实现半精度软件(我的 mac 基于 ARM),但运行时间并不比单精度或双精度软件快。 I tested half, single, double with a very simple code like just adding the number.我用一个非常简单的代码测试了一半,单,双,就像只是添加数字一样。 the speed of half is slower than single or double.一半的速度比单人或双人的慢。 In addition, single is similar to double.此外,single 类似于 double。

typedef double FP;
// double - double precision
// float - single precision
// _Float16 - half precision
int main(int argc, const char * argv[]) {

    float time;
    clock_t start1, end1;
    start1 = clock();

    int i;
    FP temp = 0;

    for(i = 0; i< 100; i++){
        temp = temp + i;
    }
    end1 = clock();
    time = (double)(end1 - start1)/CLOCKS_PER_SEC;

    printf("[] %.16f\n", time);
    return 0;
}

In my expectation, half-precision is very faster than single or double precision.在我的预期中,半精度比单精度或双精度要快得多。 How can I check half-precision is faster and float is faster than double?.如何检查半精度更快,浮点数比双精度更快?

Please Help Me.请帮我。

Here is an eminently surprising fact about floating point:这是关于浮点的一个非常令人惊讶的事实:

Single-precision ( float ) arithmetic is not necessarily faster than double precision.单精度 ( float ) 算术不一定比双精度快。

How can this be?怎么会这样? Floating-point arithmetic is hard, so doing it with twice the precision is at least twice as hard and must take longer, right?浮点运算很难,所以以两倍的精度来做至少两倍的难度,而且必须花费更长的时间,对吧?

Well, no.嗯,不。 Yes, it's more work to compute with higher precision, but as long as the work is being done by dedicated hardware (by some kind of floating point unit, or FPU), everything is probably happening in parallel .是的,以更高的精度进行计算需要更多的工作,但只要工作是由专用硬件(通过某种浮点单元或 FPU)完成的,一切都可能并行发生 Double precision may be twice as hard, and there may therefore be twice as many transistors devoted to it, but it doesn't take any longer.双精度的难度可能会增加一倍,因此专用于它的晶体管数量可能会增加一倍,但不会再花更多的时间了。

In fact, if you're on a system with an FPU that supports both single- and double-precision floating point, a good rule is: always use double .事实上,如果您的系统具有同时支持单精度和双精度浮点的 FPU,那么一个好的规则是:始终使用double The reason for this rule is that type float is often inadequately accurate.这条规则的原因是float类型通常不够准确。 So if you always use double , you'll quite often avoid numerical inaccuracies (that would kill you, if you used float ), but it won't be any slower.所以如果你总是使用double ,你会经常避免数字不准确(如果你使用float ,那会杀了你),但它不会变慢。

Now, everything I've said so far assumes that your FPU does support the types you care about, in hardware.现在,到目前为止,我所说的一切都假定您的 FPU确实支持您关心的硬件类型。 If there's a floating-point type that's not supported in hardware, if it has to be emulated in software, it's obviously going to be slower, often much slower.如果存在硬件支持的浮点类型,如果必须在软件中进行模拟,那么它显然会更慢,通常会慢得多 There are at least three areas where this effect manifests:这种影响至少体现在三个方面:

  • If you're using a microcontroller, with no FPU at all, it's common for all floating point to be implemented in software, and to be painfully slow.如果您使用的是完全没有 FPU 的微控制器,那么所有浮点都在软件中实现是很常见的,而且速度非常慢。 (I think it's also common for the double precision to be even slower, meaning that float may be advantageous there.) (我认为双精度更慢也是很常见的,这意味着float在那里可能是有利的。)
  • If you're using a nonstandard or less-than-standard type, that for that reason is implemented in software, it's obviously going to be slower.如果您使用的是非标准或低于标准的类型,由于这个原因是在软件中实现的,它显然会更慢。 In particular: FPU's I'm familiar don't support a half-precision (16-bit) floating point type, so yes, it wouldn't be surprising if it was significantly slower than regular float or double .特别是:我熟悉的 FPU 不支持半精度(16 位)浮点类型,所以是的,如果它比常规floatdouble得多也就不足为奇了。
  • Some GPU's have good support for single or half precision, but poor or no support for double.一些 GPU 对单精度或半精度有很好的支持,但对双精度的支持很差或不支持。

I've extracted out the relevant part of your code into C++ so it can be easily instantiated for each type:我已将代码的相关部分提取到 C++ 中,以便可以轻松地为每种类型实例化它:

template<typename T>
T calc() {
    T sum = 0;
    for (int i = 0; i < 100; i++) {
        sum += i;
    }
    return sum;
}

Compiling this in Clang with optimisations ( -O3 ) and looking at the assembly listing on godbolt suggests that:在 Clang 中使用优化 ( -O3 ) 编译它并查看Godbolt上的程序集列表表明:

  • the double version has the least number of instructions (4) in the inner loop double版本在内循环中的指令数量最少(4)
  • the float version has 5 instructions in the inner loop, and looks basically comparable to the double version float版本的内循环有 5 条指令,看起来和double version基本不相上下
  • the _Float16 version has 9 instructions in the inner loop, hence likely being slowest. _Float16版本在内循环中有 9 条指令,因此可能是最慢的。 the extra instructions are fcvt which convert between float16 and float32 formats.额外的指令是在fcvt和 float32 格式之间转换的 fcvt。

Note that counting instructions is only a rough guide to performance!请注意,计数指令只是性能的粗略指南! Eg Some instructions take multiple cycles to execute and pipelined execution means that multiple instructions can be executed in parallel.例如,有些指令需要多个周期才能执行,而流水线执行意味着可以并行执行多条指令。

Clang's language extension docs suggest that _Float16 is supported on ARMv8.2a, and M1 appears to be v8.4, so presumably it also supports this. Clang 的语言扩展文档表明_Float16在 ARMv8.2a 上受支持,而 M1 似乎是 v8.4,所以大概它也支持这一点。 I'm not sure how to enable this in Godbolt though, sorry!不过,我不确定如何在 Godbolt 中启用此功能,抱歉!

I'd use clock_gettime(CLOCK_MONOTONIC) for high precision (ie nanosecond) timing under Linux.我会使用clock_gettime(CLOCK_MONOTONIC)在Linux 下进行高精度(即纳秒)计时。 OSX doesn't appear to make this available, but alternatives seem available Monotonic clock on OSX . OSX 似乎没有提供此功能,但在 OSX 上似乎可以使用 Monotonic clock替代方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM