Why conversion (unsigned long long)DBL_MAX (or FLT_MAX) causes raising of FE_INEXACT as well?

Question

Code (t1.c):

#include <stdio.h>
#include <float.h>
#include <fenv.h>

#if _MSC_VER
#pragma fenv_access (on)
#else
#pragma STDC FENV_ACCESS ON
#endif


void print_fpe()
{
    int fpe = fetestexcept(FE_ALL_EXCEPT);
    printf("current exceptions raised:");
    if (fpe & FE_DIVBYZERO)       printf(" FE_DIVBYZERO");
    if (fpe & FE_INEXACT)         printf(" FE_INEXACT");
    if (fpe & FE_INVALID)         printf(" FE_INVALID");
    if (fpe & FE_OVERFLOW)        printf(" FE_OVERFLOW");
    if (fpe & FE_UNDERFLOW)       printf(" FE_UNDERFLOW");
    if ((fpe & FE_ALL_EXCEPT)==0) printf(" none");
}

volatile double d = DBL_MAX;
volatile float f = FLT_MAX;
volatile signed long long ll;
volatile signed long l;
volatile signed int i;
volatile signed short s;
volatile signed char c;
volatile unsigned long long ull;
volatile unsigned long ul;
volatile unsigned int ui;
volatile unsigned short us;
volatile unsigned char uc;

#define TEST(dst, type, src)         \
    feclearexcept(FE_ALL_EXCEPT);    \
    dst = (type)(src);               \
    print_fpe();                     \
    printf(" line %u\n", __LINE__);

int main(void)
{
    TEST(ll, signed long long, d);
    TEST(l, signed long, d);
    TEST(i, signed int, d);
    TEST(s, signed short, d);
    TEST(c, signed char, d);
    TEST(ll, signed long long, f);
    TEST(l, signed long, f);
    TEST(i, signed int, f);
    TEST(s, signed short, f);
    TEST(c, signed char, f);
    TEST(ull, unsigned long long, d); // line 55
    TEST(ul, unsigned long, d);
    TEST(ui, unsigned int, d);
    TEST(us, unsigned short, d);
    TEST(uc, unsigned char, d);
    TEST(ull, unsigned long long, f); // line 60
    TEST(ul, unsigned long, f);
    TEST(ui, unsigned int, f);
    TEST(us, unsigned short, f);
    TEST(uc, unsigned char, f);
    return 0;
}

Invocations and results:

$ cl t1.c && t1
current exceptions raised: FE_INVALID line 45
current exceptions raised: FE_INVALID line 46
current exceptions raised: FE_INVALID line 47
current exceptions raised: FE_INVALID line 48
current exceptions raised: FE_INVALID line 49
current exceptions raised: FE_INVALID line 50
current exceptions raised: FE_INVALID line 51
current exceptions raised: FE_INVALID line 52
current exceptions raised: FE_INVALID line 53
current exceptions raised: FE_INVALID line 54
current exceptions raised: FE_INEXACT FE_INVALID line 55
current exceptions raised: FE_INVALID line 56
current exceptions raised: FE_INVALID line 57
current exceptions raised: FE_INVALID line 58
current exceptions raised: FE_INVALID line 59
current exceptions raised: FE_INEXACT FE_INVALID line 60
current exceptions raised: FE_INVALID line 61
current exceptions raised: FE_INVALID line 62
current exceptions raised: FE_INVALID line 63
current exceptions raised: FE_INVALID line 64

$ clang t1.c && ./a.exe
t1.c:8:14: warning: pragma STDC FENV_ACCESS ON is not supported, ignoring pragma [-Wunknown-pragmas]
#pragma STDC FENV_ACCESS ON
             ^
1 warning generated.
current exceptions raised: FE_INVALID line 45
current exceptions raised: FE_INVALID line 46
current exceptions raised: FE_INVALID line 47
current exceptions raised: FE_INVALID line 48
current exceptions raised: FE_INVALID line 49
current exceptions raised: FE_INVALID line 50
current exceptions raised: FE_INVALID line 51
current exceptions raised: FE_INVALID line 52
current exceptions raised: FE_INVALID line 53
current exceptions raised: FE_INVALID line 54
current exceptions raised: FE_INEXACT FE_INVALID line 55
current exceptions raised: FE_INEXACT FE_INVALID line 56
current exceptions raised: FE_INVALID line 57
current exceptions raised: FE_INVALID line 58
current exceptions raised: FE_INVALID line 59
current exceptions raised: FE_INEXACT FE_INVALID line 60
current exceptions raised: FE_INEXACT FE_INVALID line 61
current exceptions raised: FE_INVALID line 62
current exceptions raised: FE_INVALID line 63
current exceptions raised: FE_INVALID line 64

$ gcc t1.c && ./a.exe
current exceptions raised: FE_INVALID line 45
current exceptions raised: FE_INVALID line 46
current exceptions raised: FE_INVALID line 47
current exceptions raised: FE_INVALID line 48
current exceptions raised: FE_INVALID line 49
current exceptions raised: FE_INVALID line 50
current exceptions raised: FE_INVALID line 51
current exceptions raised: FE_INVALID line 52
current exceptions raised: FE_INVALID line 53
current exceptions raised: FE_INVALID line 54
current exceptions raised: FE_INEXACT FE_INVALID line 55
current exceptions raised: FE_INEXACT FE_INVALID line 56
current exceptions raised: FE_INVALID line 57
current exceptions raised: FE_INVALID line 58
current exceptions raised: FE_INVALID line 59
current exceptions raised: FE_INEXACT FE_INVALID line 60
current exceptions raised: FE_INEXACT FE_INVALID line 61
current exceptions raised: FE_INVALID line 62
current exceptions raised: FE_INVALID line 63
current exceptions raised: FE_INVALID line 64

Question: why conversion (unsigned long long)DBL_MAX (or FLT_MAX ) causes raising of FE_INEXACT as well?

Answer 1

I suppose you're testing this on x86, since that's where I see the behavior you describe. Example . Here's the low-level explanation.

On x86-64, gcc, at least, does most floating-point to integer conversion with the cvttsd2si instruction, which converts a double-precision floating point number to a 32- or 64-bit signed integer, raising an "invalid" exception if the result is out of range. This instruction can be used to convert to any signed integer type, and also to unsigned integer types of 32 bits or lower - for instance, a conversion to unsigned 32-bit can be done by converting to signed 64-bit and discarding high bits.

But this does not work for conversion to unsigned 64-bit, since the input might be a number that doesn't fit in signed 64-bit but would fit in unsigned 64-bit, and x86 has no instruction to make that conversion directly. As such, some extra arithmetic is needed, and it's these additional instructions that produce the "inexact" exception. (Specifically, it does a subsd to subtract (double)LLONG_MAX from the input, which does indeed result in a loss of precision when the input is DBL_MAX .)

See Unsigned 64-bit to double conversion: why this algorithm from g++ for an example of the sorts of gymnastics that gcc does to do this as efficiently as possible.

Note that on x86-64 you actually see FP_INEXACT with conversion to unsigned long as well, since it's the same as unsigned long long . I get the exact behavior you observe on x86-32, where unsigned long long is the only 64-bit type to which this applies. The code in that case is a bit more complicated and I would leave it to you to read through the assembly if you are really interested.

By contrast, when I run this code on AArch64, all lines simply give FE_INVALID . That's because AArch64 does have a dedicated instruction to convert floating point to unsigned 64-bit ( fcvtzu ) and so there's no further arithmetic that could involve an inexact result.

Answer 2

The code (unsigned long long)DBL_MAX has undefined behaviour, as per C11 6.3.1.4:

When a finite value of real floating type is converted to an integer type other than _Bool , the fractional part is discarded (ie, the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined

Since the behaviour is undefined, "anything can happen", ie the behaviour is not covered by the standard.

Why conversion (unsigned long long)DBL_MAX (or FLT_MAX) causes raising of FE_INEXACT as well?

Question

2 answers

solution1
2 ACCPTED 2021-03-01 23:07:00

solution2
1 2021-03-01 22:48:56

Why conversion (unsigned long long)DBL_MAX (or FLT_MAX) causes raising of FE_INEXACT as well?

Question

2 answers

solution1 2 ACCPTED 2021-03-01 23:07:00

solution2 1 2021-03-01 22:48:56

solution1
2 ACCPTED 2021-03-01 23:07:00

solution2
1 2021-03-01 22:48:56