[英]Double vs Float vs _Float16 (Running Time)
I have a simple question in C language.我有一个 C 语言的简单问题。 I am implementing a half-precision software using _Float16 in C (My mac is based on ARM), but running time is not quite faster than single or double-precision software.
我正在使用 C 中的 _Float16 实现半精度软件(我的 mac 基于 ARM),但运行时间并不比单精度或双精度软件快。 I tested half, single, double with a very simple code like just adding the number.
我用一个非常简单的代码测试了一半,单,双,就像只是添加数字一样。 the speed of half is slower than single or double.
一半的速度比单人或双人的慢。 In addition, single is similar to double.
此外,single 类似于 double。
typedef double FP;
// double - double precision
// float - single precision
// _Float16 - half precision
int main(int argc, const char * argv[]) {
float time;
clock_t start1, end1;
start1 = clock();
int i;
FP temp = 0;
for(i = 0; i< 100; i++){
temp = temp + i;
}
end1 = clock();
time = (double)(end1 - start1)/CLOCKS_PER_SEC;
printf("[] %.16f\n", time);
return 0;
}
In my expectation, half-precision is very faster than single or double precision.在我的预期中,半精度比单精度或双精度要快得多。 How can I check half-precision is faster and float is faster than double?.
如何检查半精度更快,浮点数比双精度更快?
Please Help Me.请帮我。
Here is an eminently surprising fact about floating point:这是关于浮点的一个非常令人惊讶的事实:
Single-precision (
float
) arithmetic is not necessarily faster than double precision.单精度 (
float
) 算术不一定比双精度快。
How can this be?怎么会这样? Floating-point arithmetic is hard, so doing it with twice the precision is at least twice as hard and must take longer, right?
浮点运算很难,所以以两倍的精度来做至少两倍的难度,而且必须花费更长的时间,对吧?
Well, no.嗯,不。 Yes, it's more work to compute with higher precision, but as long as the work is being done by dedicated hardware (by some kind of floating point unit, or FPU), everything is probably happening in parallel .
是的,以更高的精度进行计算需要更多的工作,但只要工作是由专用硬件(通过某种浮点单元或 FPU)完成的,一切都可能并行发生。 Double precision may be twice as hard, and there may therefore be twice as many transistors devoted to it, but it doesn't take any longer.
双精度的难度可能会增加一倍,因此专用于它的晶体管数量可能会增加一倍,但不会再花更多的时间了。
In fact, if you're on a system with an FPU that supports both single- and double-precision floating point, a good rule is: always use double
.事实上,如果您的系统具有同时支持单精度和双精度浮点的 FPU,那么一个好的规则是:始终使用
double
。 The reason for this rule is that type float
is often inadequately accurate.这条规则的原因是
float
类型通常不够准确。 So if you always use double
, you'll quite often avoid numerical inaccuracies (that would kill you, if you used float
), but it won't be any slower.所以如果你总是使用
double
,你会经常避免数字不准确(如果你使用float
,那会杀了你),但它不会变慢。
Now, everything I've said so far assumes that your FPU does support the types you care about, in hardware.现在,到目前为止,我所说的一切都假定您的 FPU确实支持您关心的硬件类型。 If there's a floating-point type that's not supported in hardware, if it has to be emulated in software, it's obviously going to be slower, often much slower.
如果存在硬件不支持的浮点类型,如果必须在软件中进行模拟,那么它显然会更慢,通常会慢得多。 There are at least three areas where this effect manifests:
这种影响至少体现在三个方面:
float
may be advantageous there.) float
在那里可能是有利的。)float
or double
.float
或double
慢得多也就不足为奇了。I've extracted out the relevant part of your code into C++ so it can be easily instantiated for each type:我已将代码的相关部分提取到 C++ 中,以便可以轻松地为每种类型实例化它:
template<typename T>
T calc() {
T sum = 0;
for (int i = 0; i < 100; i++) {
sum += i;
}
return sum;
}
Compiling this in Clang with optimisations ( -O3
) and looking at the assembly listing on godbolt suggests that:在 Clang 中使用优化 (
-O3
) 编译它并查看Godbolt上的程序集列表表明:
double
version has the least number of instructions (4) in the inner loop double
版本在内循环中的指令数量最少(4)float
version has 5 instructions in the inner loop, and looks basically comparable to the double version
float
版本的内循环有 5 条指令,看起来和double version
基本不相上下_Float16
version has 9 instructions in the inner loop, hence likely being slowest. _Float16
版本在内循环中有 9 条指令,因此可能是最慢的。 the extra instructions are fcvt
which convert between float16 and float32 formats.fcvt
和 float32 格式之间转换的 fcvt。 Note that counting instructions is only a rough guide to performance!请注意,计数指令只是性能的粗略指南! Eg Some instructions take multiple cycles to execute and pipelined execution means that multiple instructions can be executed in parallel.
例如,有些指令需要多个周期才能执行,而流水线执行意味着可以并行执行多条指令。
Clang's language extension docs suggest that _Float16
is supported on ARMv8.2a, and M1 appears to be v8.4, so presumably it also supports this. Clang 的语言扩展文档表明
_Float16
在 ARMv8.2a 上受支持,而 M1 似乎是 v8.4,所以大概它也支持这一点。 I'm not sure how to enable this in Godbolt though, sorry!不过,我不确定如何在 Godbolt 中启用此功能,抱歉!
I'd use clock_gettime(CLOCK_MONOTONIC)
for high precision (ie nanosecond) timing under Linux.我会使用
clock_gettime(CLOCK_MONOTONIC)
在Linux 下进行高精度(即纳秒)计时。 OSX doesn't appear to make this available, but alternatives seem available Monotonic clock on OSX . OSX 似乎没有提供此功能,但在 OSX 上似乎可以使用 Monotonic clock替代方案。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.