Truncating a double to a float in C

Question

This a very simple question, but an important one since it affects my whole project tremendously.

Suppose I have the following code snipet:

unsigned int x = 0xffffffff;
float f = (float)((double)x * (double)2.328306436538696e-010); //  x/2^32

I would expect that f be something like 0.99999, but instead, it rounds up to 1, since it's the closest float approximation. That's not good since I need float values on the interval of [0,1), not [0,1]. I'm sure it's something simple, but I'd appreciate some help.

Answer 1

In C (since C99), you can change the rounding direction with fesetround from libm

#include <stdio.h>
#include <fenv.h>
int main()
{
    #pragma STDC FENV_ACCESS ON
    fesetround(FE_DOWNWARD);
    // volatile -- uncomment for GNU gcc and whoever else doesn't support FENV
    unsigned long x = 0xffffffff;
    float f = (float)((double)x * (double)2.328306436538696e-010); //  x/2^32
    printf("%.50f\n", f);
}

Tested with IBM XL, Sun Studio, clang, GNU gcc. This gives me 0.99999994039535522460937500000000000000000000000000 in all cases

Answer 2

The value above which a double rounds to 1 or more when converted to float in the default IEEE 754 rounding mode is 0x1.ffffffp-1 (in C99's hexadecimal notation, since your question is tagged “C”).

Your options are:

turn the FPU rounding mode to round-downward before the conversion, or
multiply by (0x1.ffffffp-1 / 0xffffffffp0) (give or take one ULP) to exploit the full single-precision range [0, 1) without getting the value 1.0f .

Method 2 leads to use the constant 0x1.ffffff01fffffp-33 :

double factor = nextafter(0x1.ffffffp-1 / 0xffffffffp0, 0.0);
unsigned int x = 0xffffffff;
float f = (float)((double)x * factor);
printf("factor:%a\nunrounded:%a\nresult:%a\n", factor, (double)x * factor, f);

Prints:

factor:0x1.ffffff01fffffp-33
unrounded:0x1.fffffefffffffp-1
result:0x1.fffffep-1

Answer 3

There's not much you can do - your int holds 32 bits but the mantissa of a float holds only 24. Rounding is going to happen. You could change the processor rounding mode to round down instead of to nearest, but that is going to cause some side effects that you want to avoid especially if you don't restore the rounding mode when you are finished.

There's nothing wrong with the formula you're using, it's producing the most accurate answer possible for the given input. There's just an end case that's failing a hard requirement. There's nothing wrong with testing for the specific end case and replacing it with the closest value that meets the requirement:

if (f >= 1.0f)
    f = 0.99999994f;

0.999999940395355224609375 is the closest value that an IEEE-754 float can take without being equal to 1.0.

Answer 4

You could just truncate the value to maximum precision (keeping the 24 high bits) and divide by 2^24 to get the closest value a float can represent without being rounded to 1;

unsigned int i = 0xffffffff;
float value = (float)(i>>8)/(1<<24);

printf("%.20f\n", value);
printf("%a\n", value);

>>> 0.99999994039535522461
>>> 0x1.fffffep-1

Answer 5

My eventual solution was to just shrink the size of my constant multiplier. It was probably the best solution since there was no point in multiplying by a double anyway. The precision was not seen after conversion to a float.

so 2.328306436538696e-010 was changed to 2.3283063

Truncating a double to a float in C

Question

5 answers

solution1
8 2013-08-06 16:35:27

solution2
3 2013-08-06 16:33:21

solution3
1 2013-08-06 16:35:08

solution4
1 2013-08-06 16:50:04

solution5
0 ACCPTED 2013-08-15 16:28:50

Truncating a double to a float in C

Question

5 answers

solution1 8 2013-08-06 16:35:27

solution2 3 2013-08-06 16:33:21

solution3 1 2013-08-06 16:35:08

solution4 1 2013-08-06 16:50:04

solution5 0 ACCPTED 2013-08-15 16:28:50

solution1
8 2013-08-06 16:35:27

solution2
3 2013-08-06 16:33:21

solution3
1 2013-08-06 16:35:08

solution4
1 2013-08-06 16:50:04

solution5
0 ACCPTED 2013-08-15 16:28:50