Floating-point-to-integer conversion rounding up instead of truncating

Question

I was surprised to find that a floating-point-to-integer conversion rounded up instead of truncating the fractional part. Here is some sample code, compiled using Clang, that reproduces that behavior:

double a = 1.12;  // 1.1200000000000001 * 2^0
double b = 1024LL * 1024 * 1024 * 1024 * 1024;  // 1 * 2^50
double c = a * b;  // 1.1200000000000001 * 2^50
long long d = c;  // 1261007895663739

Using exact math, the floating-point value represents

1.1200000000000001 * 2^50 = 1261007895663738.9925899906842624

I was expecting the resulting integer to be 1261007895663738 due to truncation but it is actually 1261007895663739 . Why?

Answer 1

Assuming IEEE 754 double precision, 1.12 is exactly

1.12000000000000010658141036401502788066864013671875

Written in binary, its significand is exactly:

1.0001111010111000010100011110101110000101000111101100

Note the last two zeros are intentional, since it's what you get with double precision (1 bit before fraction separator, plus 52 fractional bits).

So, if you shift by 50 places, you'll get an integer value

100011110101110000101000111101011100001010001111011.00

or in decimal

1261007895663739

when converting to long long, no truncation/rounding occurs, the conversion is exact.

Answer 2

Using exact math, the floating-point value represents...

a is not exactly 1.12 as 0.12 is not dyadic .

// `a` not exactly 1.12 
double a = 1.12;  // 1.1200000000000001 * 2^0

Nearby double values:

1.11999999999999988...  Next closest double
1.12                    Code
1.12000000000000011...  Closest double
1.12000000000000033...

Instead, let us look closer to truer values.

#include <stdio.h>
#include <float.h>

int main() {
  double a = 1.12;  // 1.1200000000000001 * 2^0
  double b = 1024LL * 1024 * 1024 * 1024 * 1024;  // 1 * 2^50
  int prec = DBL_DECIMAL_DIG;
  printf("a %.*e\n", prec, a);
  printf("b %.*e\n", prec, b);

  double c = a * b;
  double whole;
  printf("c %.*e (r:%g)\n", prec, c, modf(c, &whole));
  long long d = (long long) c;
  printf("d %lld\n", d);
}

Output

a 1.12000000000000011e+00
b 1.12589990684262400e+15
c 1.26100789566373900e+15 (r:0)
d 1261007895663739

Floating-point-to-integer conversion rounding up instead of truncating

Question

2 answers

solution1
3 ACCPTED 2021-03-06 23:11:46

solution2
1 2021-03-07 17:21:40

Floating-point-to-integer conversion rounding up instead of truncating

Question

2 answers

solution1 3 ACCPTED 2021-03-06 23:11:46

solution2 1 2021-03-07 17:21:40

solution1
3 ACCPTED 2021-03-06 23:11:46

solution2
1 2021-03-07 17:21:40