This a very simple question, but an important one since it affects my whole project tremendously.
Suppose I have the following code snipet:
unsigned int x = 0xffffffff;
float f = (float)((double)x * (double)2.328306436538696e-010); // x/2^32
I would expect that f
be something like 0.99999, but instead, it rounds up to 1, since it's the closest float
approximation. That's not good since I need float
values on the interval of [0,1), not [0,1]. I'm sure it's something simple, but I'd appreciate some help.
In C (since C99), you can change the rounding direction with fesetround from libm
#include <stdio.h>
#include <fenv.h>
int main()
{
#pragma STDC FENV_ACCESS ON
fesetround(FE_DOWNWARD);
// volatile -- uncomment for GNU gcc and whoever else doesn't support FENV
unsigned long x = 0xffffffff;
float f = (float)((double)x * (double)2.328306436538696e-010); // x/2^32
printf("%.50f\n", f);
}
Tested with IBM XL, Sun Studio, clang, GNU gcc. This gives me 0.99999994039535522460937500000000000000000000000000
in all cases
The value above which a double
rounds to 1 or more when converted to float
in the default IEEE 754 rounding mode is 0x1.ffffffp-1
(in C99's hexadecimal notation, since your question is tagged “C”).
Your options are:
(0x1.ffffffp-1 / 0xffffffffp0)
(give or take one ULP) to exploit the full single-precision range [0, 1) without getting the value 1.0f
. Method 2 leads to use the constant 0x1.ffffff01fffffp-33
:
double factor = nextafter(0x1.ffffffp-1 / 0xffffffffp0, 0.0);
unsigned int x = 0xffffffff;
float f = (float)((double)x * factor);
printf("factor:%a\nunrounded:%a\nresult:%a\n", factor, (double)x * factor, f);
Prints:
factor:0x1.ffffff01fffffp-33
unrounded:0x1.fffffefffffffp-1
result:0x1.fffffep-1
There's not much you can do - your int
holds 32 bits but the mantissa of a float
holds only 24. Rounding is going to happen. You could change the processor rounding mode to round down instead of to nearest, but that is going to cause some side effects that you want to avoid especially if you don't restore the rounding mode when you are finished.
There's nothing wrong with the formula you're using, it's producing the most accurate answer possible for the given input. There's just an end case that's failing a hard requirement. There's nothing wrong with testing for the specific end case and replacing it with the closest value that meets the requirement:
if (f >= 1.0f)
f = 0.99999994f;
0.999999940395355224609375 is the closest value that an IEEE-754 float can take without being equal to 1.0.
You could just truncate the value to maximum precision (keeping the 24 high bits) and divide by 2^24 to get the closest value a float can represent without being rounded to 1;
unsigned int i = 0xffffffff;
float value = (float)(i>>8)/(1<<24);
printf("%.20f\n", value);
printf("%a\n", value);
>>> 0.99999994039535522461
>>> 0x1.fffffep-1
My eventual solution was to just shrink the size of my constant multiplier. It was probably the best solution since there was no point in multiplying by a double anyway. The precision was not seen after conversion to a float.
so 2.328306436538696e-010
was changed to 2.3283063
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.