Why addition error near `DBL_MAX` and infinity?

Question

In experimenting with rounding modes, FE_DOWNWARD ( and FE_TOWARDZERO ) apparently fails to form the expected sum of infinity and instead forms DBL_MAX when adding DBL_MAX and 1 ULP of DBL_MAX .

Follows is test code that demos the unexpected sum. Under different rounds modes, it adds to DBL_MAX values near 0.5 ULP and 1.0 ULP. No problems noted in FE_TONEAREST and FE_UPWARD .

Questions:

Do you agree it is an error?
Does code form the correct answer on another machine?
This sadly follows another near DBL_MAX problem recently reported, so perhaps my math library is sub-par. Advice on how to report is requested.

Compiler notes:
Invoking: Cygwin C Compiler
gcc -std=c11 -O0 -g3 -pedantic -Wall -Wextra -Wconversion -c -fmessage-length=0 -v -MMD -MP -MF"rand_i.d" -MT"rand_i.o" -o "rand_i.o" "../rand_i.c"
COLLECT_GCC=gcc
Target: x86_64-pc-cygwin
gcc version 11.3.0 (GCC)
GNU C11 (GCC) version 11.3.0 (x86_64-pc-cygwin)
compiled by GNU C version 11.3.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version isl-0.25-GMP

#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>

#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)

int main() {
  const double max = DBL_MAX;
  const double max_ulp = max - nextafter(max, 0);

  int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
  const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
      STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
  int n = sizeof mode / sizeof mode[0];
  int p = (DBL_MANT_DIG + 2)/4;
  int P = (LDBL_MANT_DIG + 2)/4;
  printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);

  printf("LDBL_MAX   :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG, LDBL_MAX);
  printf("DBL_MAX    :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
  printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
  printf("\n");
  printf("mode:               Addendum                          SUM (double)                 SUM (long double)\n");
  for (int i = 0; i < n; i++) {
    if (fesetround(mode[i])) {
      perror("Invalid mode");
      return -1;
    }
    double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
        nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
        max_ulp, nextafter(max_ulp, INFINITY)};
    const char *deltas[] = { "0.5 ulp-", "0.5 ulp", "0.5 ulp+", //
        "ulp-", "ulp", "ulp+"};
    int dn = sizeof delta / sizeof delta[0];
    for (int d = 0; d < dn; d++) {
      double sum = max + delta[d];
      printf("mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La\n", //
          modes[i],  deltas[d], p, delta[d], p, sum, P, 0.0L + max + delta[d]);
    }
    puts("");
  }
}

Output: note 4 unexpected lines.

FLT_EVAL_METHOD:0
LDBL_MAX   :0x1.fffffffffffffffep+16383 1.18973149535723176502e+4932
DBL_MAX    :0x1.fffffffffffffp+1023  1.7976931348623157e+308
DBL_MAX_ULP:0x1.0000000000000p+971   1.9958403095347198e+292

mode:               Addendum                          SUM (double)                 SUM (long double)
mode:FE_DOWNWARD    0.5 ulp-:0x1.fffffffffffffp+969   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff7fep+1023
mode:FE_DOWNWARD    0.5 ulp :0x1.0000000000000p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023
mode:FE_DOWNWARD    0.5 ulp+:0x1.0000000000001p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023
mode:FE_DOWNWARD    ulp-    :0x1.fffffffffffffp+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffffffep+1023
mode:FE_DOWNWARD    ulp     :0x1.0000000000000p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024
                                                      --> inf expected <--
mode:FE_DOWNWARD    ulp+    :0x1.0000000000001p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024
                                                      --> inf expected <--

mode:FE_TONEAREST   0.5 ulp-:0x1.fffffffffffffp+969   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023
mode:FE_TONEAREST   0.5 ulp :0x1.0000000000000p+970   sum:inf                      0x1.fffffffffffff800p+1023
mode:FE_TONEAREST   0.5 ulp+:0x1.0000000000001p+970   sum:inf                      0x1.fffffffffffff800p+1023
mode:FE_TONEAREST   ulp-    :0x1.fffffffffffffp+970   sum:inf                      0x1.0000000000000000p+1024
mode:FE_TONEAREST   ulp     :0x1.0000000000000p+971   sum:inf                      0x1.0000000000000000p+1024
mode:FE_TONEAREST   ulp+    :0x1.0000000000001p+971   sum:inf                      0x1.0000000000000000p+1024

mode:FE_TOWARDZERO  0.5 ulp-:0x1.fffffffffffffp+969   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff7fep+1023
mode:FE_TOWARDZERO  0.5 ulp :0x1.0000000000000p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023
mode:FE_TOWARDZERO  0.5 ulp+:0x1.0000000000001p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023
mode:FE_TOWARDZERO  ulp-    :0x1.fffffffffffffp+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffffffep+1023
mode:FE_TOWARDZERO  ulp     :0x1.0000000000000p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024
                                                      --> inf expected <--
mode:FE_TOWARDZERO  ulp+    :0x1.0000000000001p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024
                                                      --> inf expected <--

mode:FE_UPWARD      0.5 ulp-:0x1.fffffffffffffp+969   sum:inf                      0x1.fffffffffffff800p+1023
mode:FE_UPWARD      0.5 ulp :0x1.0000000000000p+970   sum:inf                      0x1.fffffffffffff800p+1023
mode:FE_UPWARD      0.5 ulp+:0x1.0000000000001p+970   sum:inf                      0x1.fffffffffffff802p+1023
mode:FE_UPWARD      ulp-    :0x1.fffffffffffffp+970   sum:inf                      0x1.0000000000000000p+1024
mode:FE_UPWARD      ulp     :0x1.0000000000000p+971   sum:inf                      0x1.0000000000000000p+1024
mode:FE_UPWARD      ulp+    :0x1.0000000000001p+971   sum:inf                      0x1.0000000000000002p+1024

Answer 1

Checking for and dealing with overflow happens in two stages. Note that the actual software/circuitry may use a different strategy, but the result will be as if this were the procedure.

The infinite-precision result of the computation is rounded, with rules based on the current rounding mode, but with no limits on the exponent. During this stage there's no such thing as "infinity", but the numbers can get as big as they need to get.
If the result is outside the representable range, it is "corrected" to be within the representable range, in a manner based on the rounding mode. For FE_NEAREST (the "normal" mode) the number is corrected to infinity. For FE_TOWARDZERO , however, it's corrected to +/-DBL_MAX . (For the other rounding modes it depends on sign: rounding away from zero leads to infinity and rounding toward zero leads to +/-DBL_MAX .)

The overflow rules for a given mode are thus reminiscent of the rounding rules for that mode, but not really the same. Arguably FE_NEAREST is the weird one, since it acts like the (IEEE-754-only) ties-away-from-zero rounding mode under all out-of-range situations, rather than noticing that infinity is way farther away than DBL_MAX . But the basic behavior for the other modes is to output +/-DBL_MAX when its preferred direction (given the sign) is toward zero, and infinity when its preferred direction is away from zero.

Also note that when the result of stage 1 is a value outside the representable range, an overflow exception will still be emitted even when the result of stage 2 is DBL_MAX . The overflow doesn't indicate "I made an infinity", but "I had to do stage 2".

Answer 2

@Sneftel idea to use a 2 stage rounding model was useful and well explains why my expected result was amiss and the code was right.

Below, for reference, is augmented code that posts the overflow flag and helps illustrates the models application.

#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>

#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)

const char* exceptstr(fexcept_t *flag) {
  static char buf[100];
  buf[0] = 0;
  if (*flag & FE_DIVBYZERO)
    strcat(buf, STRINGIFY(FE_DIVBYZERO));
  if (*flag & FE_INEXACT)
    strcat(buf, STRINGIFY(FE_INEXACT));
  if (*flag & FE_INVALID)
    strcat(buf, STRINGIFY(FE_INVALID));
  if (*flag & FE_OVERFLOW)
    strcat(buf, STRINGIFY(FE_OVERFLOW));
  if (*flag & FE_UNDERFLOW)
    strcat(buf, STRINGIFY(FE_UNDERFLOW));
  return buf;
}

int main() {
  const double max = DBL_MAX;
  const double max_ulp = max - nextafter(max, 0);

  int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
  const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
  STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
  int n = sizeof mode / sizeof mode[0];
  int p = (DBL_MANT_DIG + 2) / 4;
  int P = (LDBL_MANT_DIG + 2) / 4;
  printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);

  printf("LDBL_MAX   :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG,
  LDBL_MAX);
  printf("DBL_MAX    :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
  printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
  printf("\n");
  printf(
      "mode:               Addendum                          SUM (double)                 SUM (long double)          FE\n");
  for (int i = 0; i < n; i++) {
    if (fesetround(mode[i])) {
      perror("Invalid mode");
      return -1;
    }
    double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
    nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
    max_ulp, nextafter(max_ulp, INFINITY)};
    const char *deltas[] = {"0.5 ulp-", "0.5 ulp", "0.5 ulp+", //
        "ulp-", "ulp", "ulp+"};
    int dn = sizeof delta / sizeof delta[0];
    for (int d = 0; d < dn; d++) {
      if (feclearexcept(FE_ALL_EXCEPT)) {
        perror("feclearexcept()");
        return -1;
      }
      ///////////////////////////////
      double sum = max + delta[d];
      ///////////////////////////////
      fexcept_t flag;
      if (fegetexceptflag(&flag, FE_ALL_EXCEPT)) {
        perror("fegetexceptflag()");
        return -1;
      }

      printf(
          "mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La %s\n", //
          modes[i], deltas[d], p, delta[d], p, sum, P, 1.0L * max + delta[d],
          exceptstr(&flag));
    }
    puts("");
  }
}

Output (Look to the far right)

FLT_EVAL_METHOD:0
LDBL_MAX   :0x1.fffffffffffffffep+16383 1.18973149535723176502e+4932
DBL_MAX    :0x1.fffffffffffffp+1023  1.7976931348623157e+308
DBL_MAX_ULP:0x1.0000000000000p+971   1.9958403095347198e+292

mode:               Addendum                          SUM (double)                 SUM (long double)          FE
mode:FE_DOWNWARD    0.5 ulp-:0x1.fffffffffffffp+969   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff7fep+1023 FE_INEXACT
mode:FE_DOWNWARD    0.5 ulp :0x1.0000000000000p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_DOWNWARD    0.5 ulp+:0x1.0000000000001p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_DOWNWARD    ulp-    :0x1.fffffffffffffp+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffffffep+1023 FE_INEXACT
mode:FE_DOWNWARD    ulp     :0x1.0000000000000p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_DOWNWARD    ulp+    :0x1.0000000000001p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW

mode:FE_TONEAREST   0.5 ulp-:0x1.fffffffffffffp+969   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_TONEAREST   0.5 ulp :0x1.0000000000000p+970   sum:inf                      0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST   0.5 ulp+:0x1.0000000000001p+970   sum:inf                      0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST   ulp-    :0x1.fffffffffffffp+970   sum:inf                      0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST   ulp     :0x1.0000000000000p+971   sum:inf                      0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST   ulp+    :0x1.0000000000001p+971   sum:inf                      0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW

mode:FE_TOWARDZERO  0.5 ulp-:0x1.fffffffffffffp+969   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff7fep+1023 FE_INEXACT
mode:FE_TOWARDZERO  0.5 ulp :0x1.0000000000000p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_TOWARDZERO  0.5 ulp+:0x1.0000000000001p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_TOWARDZERO  ulp-    :0x1.fffffffffffffp+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffffffep+1023 FE_INEXACT
mode:FE_TOWARDZERO  ulp     :0x1.0000000000000p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TOWARDZERO  ulp+    :0x1.0000000000001p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW

mode:FE_UPWARD      0.5 ulp-:0x1.fffffffffffffp+969   sum:inf                      0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD      0.5 ulp :0x1.0000000000000p+970   sum:inf                      0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD      0.5 ulp+:0x1.0000000000001p+970   sum:inf                      0x1.fffffffffffff802p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD      ulp-    :0x1.fffffffffffffp+970   sum:inf                      0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD      ulp     :0x1.0000000000000p+971   sum:inf                      0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD      ulp+    :0x1.0000000000001p+971   sum:inf                      0x1.0000000000000002p+1024 FE_INEXACTFE_OVERFLOW

Why addition error near `DBL_MAX` and infinity?

Question

2 answers

solution1
2 2023-02-01 11:57:41

solution2
0

Why addition error near `DBL_MAX` and infinity?

Question

2 answers

solution1 2 2023-02-01 11:57:41

solution2 0

solution1
2 2023-02-01 11:57:41

solution2
0