简体   繁体   中英

Double vs Float vs _Float16 (Running Time)

I have a simple question in C language. I am implementing a half-precision software using _Float16 in C (My mac is based on ARM), but running time is not quite faster than single or double-precision software. I tested half, single, double with a very simple code like just adding the number. the speed of half is slower than single or double. In addition, single is similar to double.

typedef double FP;
// double - double precision
// float - single precision
// _Float16 - half precision
int main(int argc, const char * argv[]) {

    float time;
    clock_t start1, end1;
    start1 = clock();

    int i;
    FP temp = 0;

    for(i = 0; i< 100; i++){
        temp = temp + i;
    }
    end1 = clock();
    time = (double)(end1 - start1)/CLOCKS_PER_SEC;

    printf("[] %.16f\n", time);
    return 0;
}

In my expectation, half-precision is very faster than single or double precision. How can I check half-precision is faster and float is faster than double?.

Please Help Me.

Here is an eminently surprising fact about floating point:

Single-precision ( float ) arithmetic is not necessarily faster than double precision.

How can this be? Floating-point arithmetic is hard, so doing it with twice the precision is at least twice as hard and must take longer, right?

Well, no. Yes, it's more work to compute with higher precision, but as long as the work is being done by dedicated hardware (by some kind of floating point unit, or FPU), everything is probably happening in parallel . Double precision may be twice as hard, and there may therefore be twice as many transistors devoted to it, but it doesn't take any longer.

In fact, if you're on a system with an FPU that supports both single- and double-precision floating point, a good rule is: always use double . The reason for this rule is that type float is often inadequately accurate. So if you always use double , you'll quite often avoid numerical inaccuracies (that would kill you, if you used float ), but it won't be any slower.

Now, everything I've said so far assumes that your FPU does support the types you care about, in hardware. If there's a floating-point type that's not supported in hardware, if it has to be emulated in software, it's obviously going to be slower, often much slower. There are at least three areas where this effect manifests:

  • If you're using a microcontroller, with no FPU at all, it's common for all floating point to be implemented in software, and to be painfully slow. (I think it's also common for the double precision to be even slower, meaning that float may be advantageous there.)
  • If you're using a nonstandard or less-than-standard type, that for that reason is implemented in software, it's obviously going to be slower. In particular: FPU's I'm familiar don't support a half-precision (16-bit) floating point type, so yes, it wouldn't be surprising if it was significantly slower than regular float or double .
  • Some GPU's have good support for single or half precision, but poor or no support for double.

I've extracted out the relevant part of your code into C++ so it can be easily instantiated for each type:

template<typename T>
T calc() {
    T sum = 0;
    for (int i = 0; i < 100; i++) {
        sum += i;
    }
    return sum;
}

Compiling this in Clang with optimisations ( -O3 ) and looking at the assembly listing on godbolt suggests that:

  • the double version has the least number of instructions (4) in the inner loop
  • the float version has 5 instructions in the inner loop, and looks basically comparable to the double version
  • the _Float16 version has 9 instructions in the inner loop, hence likely being slowest. the extra instructions are fcvt which convert between float16 and float32 formats.

Note that counting instructions is only a rough guide to performance! Eg Some instructions take multiple cycles to execute and pipelined execution means that multiple instructions can be executed in parallel.

Clang's language extension docs suggest that _Float16 is supported on ARMv8.2a, and M1 appears to be v8.4, so presumably it also supports this. I'm not sure how to enable this in Godbolt though, sorry!

I'd use clock_gettime(CLOCK_MONOTONIC) for high precision (ie nanosecond) timing under Linux. OSX doesn't appear to make this available, but alternatives seem available Monotonic clock on OSX .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM