简体   繁体   中英

When is a floating point operation 'invalid'?

Consider (both msvc15 and 16 ie Visual Studio 2017 & 2019) on a Xeon 15something:

int main()
{
    unsigned int x;
    uint8_t val;
    float f;

    x = _status87();    // x = 0 here, OK
    f = -1.00e+9;
    x = _status87();    // x = 0 here, OK
    val = uint8_t(f);   // val = 0 here, I can live with that
    x = _status87();    // x = 0 here, OK
    f = -1.00e+10;
    val = uint8_t(f);   // val = 0 here, I can live with that
    x = _status87();    // x = 16 = _EM_INVALID, wtf?
}

It's obvious that some casts give the 'wrong' result, ie when you want to store a number that is more than what fits in a variable of a certain type, there is no way to store that value. My question is - why is the status flag of the floating point register set to 'invalid'? Over/underflow and/or inexact I could live with, by why 'invalid'? I can't find any definition anywhere of what specific CPU's consider 'invalid' floating point operations. I also can't find out why, with a mantissa 9 this register is not set (despite the value not fitting and the cast result being 0), but with a mantissa 10 it is flagged. It seems to me that no relevant maximum/minimum is being passed at that threshold.

More importantly (to me), is there a way for me to cast in a way so that the floating point register isn't touched, ever? The reason being that the code I'm working on relies (later on) on the register not being in an 'invalid' state, and I can't reasonably or reliably modify each use of that register flag check. But also just resetting the flag is error-prone (because of assumptions elsewhere, 'elsewhere' being code I can't touch). I've been looking at boost::numeric_cast but that doesn't seem to help any here, unless I'm missing something somewhere?

But in general, any help on how 'invalid' floating point operations work would be helpful.

In the generated assembly , we can see that for the conversion is used the instruction cvttss2si . The documentation for this instruction reads :

Converts a single-precision floating-point value in the source operand (the second operand) to a signed double-word integer (or signed quadword integer if operand size is 64 bits) in the destination operand (the first operand).

Since the register used there is eax , the double-word case applies here. Next, there is written:

If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised .

In your case, -1e9 can be stored in a signed double word, but -1e10 cannot. The exception is then seemingly just translated into the status register read by _status87() function.


Note that the behavior is undefined according to the C++ Standard here according to conv.fpint/1 :

A prvalue of a floating-point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type .

This holds for both values of f .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM