float a = 1.0 + ((float) (1 << 25))
float b = 1.0 + ((float) (1 << 26))
float c = 1.0 + ((float) (1 << 27))
What are the float values of a, b, and c after running this code? Explain why the bit layout of a, b, and c causes each value to be what it is.
What are the float values of a, b, and c after running this code?
When int
is 32-bits, the below integer shifts are well defined and exact. Code is not shifting a float
@EOF .
// OK with 32-bit int
1 << 25
1 << 26
1 << 27
Casts to float
, the above power-of-2 values, are also well defined with no precision loss.
// OK and exact
(float) (1 << 25)
(float) (1 << 26)
(float) (1 << 27)
Adding to those to a double
1.0 are well defined exact sums. A typical double
has a 53 bit significand and can represent 0x8000001.0p0
exactly. eg: DBL_MANT_DIG == 53
// Let us use hexadecimal FP notation
1.0 + ((float) (1 << 25)) // 0x2000001.0p0 or 0x1.0000008p+25
1.0 + ((float) (1 << 26)) // 0x4000001.0p0 or 0x1.0000004p+26
1.0 + ((float) (1 << 27)) // 0x8000001.0p0 or 0x1.0000002p+27
Finally code attempts to assign double
values to a float
, while within the range of a typical float
encoding, cannot represent the values exactly.
A typical float
has a 24 bit significand. eg: FLT_MANT_DIG == 24
If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner. C17dr § 6.3.1.4 2.
A typical implementation-defined manner rounds to nearest, ties to even.
float s = 0x0800001.0p0; printf("%a\n", s);
float t = 0x1000001.0p0; printf("%a\n", t);// 0x1000001.0p0 1/2 way between two floats
float a = 0x2000001.0p0; printf("%a\n", a);
float b = 0x4000001.0p0; printf("%a\n", b);
float c = 0x8000001.0p0; printf("%a\n", c);
Output
0x1.000002p+23 // exact conversion double to float
0x1p+24
0x1p+25
0x1p+26
0x1p+27
Explain why the bit layout of a, b, and c causes each value to be what it is.
The bit layout is not the issue. It is the property of the float
with FLT_MANT_DIG == 24
, a 24-bit significand and implementation defined behavior , that results in the rounding of the double
value to the nearby float
one. Any float
layout with FLT_MANT_DIG == 24
would have like results.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.