简体   繁体   中英

Floating point rounding when truncating

This is probably a question for an x86 FPU expert:

I am trying to write a function which generates a random floating point value in the range [min,max]. The problem is that my generator algorithm (the floating-point Mersenne Twister, if you're curious) only returns values in the range [1,2) - ie, I want an inclusive upper bound, but my "source" generated value is from an exclusive upper bound. The catch here is that the underlying generator returns an 8-byte double, but I only want a 4-byte float, and I am using the default FPU rounding mode of Nearest.

What I want to know is whether the truncation itself in this case will result in my return value being inclusive of max when the FPU internal 80-bit value is sufficiently close, or whether I should increment the significand of my max value before multiplying it by the intermediary random in [1,2), or whether I should change FPU modes. Or any other ideas, of course.

Here's the code I am currently using, and I did verify that 1.0f resolves to 0x3f800000:

float MersenneFloat( float min, float max )
{
    //genrand returns a double in [1,2)
    const float random = (float)genrand_close1_open2(); 
    //return in desired range
    return min + ( random - 1.0f ) * (max - min);
}

If it makes a difference, this needs to work on both Win32 MSVC++ and Linux gcc. Also, will using any versions of the SSE optimizations change the answer to this?

Edit: The answer is yes, truncation in this case from double to float is sufficient to cause the result to be inclusive of max. See Crashworks' answer for more.

The SSE ops will subtly change the behavior of this algorithm because they don't have the intermediate 80-bit representation -- the math truly is done in 32 or 64 bits. The good news is that you can easily test it and see if it changes your results by simply specifying the /ARCH:SSE2 command line option to MSVC, which will cause it to use the SSE scalar ops instead of x87 FPU instructions for ordinary floating point math.

I'm not sure offhand of what the exact rounding behavior is around the integer boundaries, but you can test to see what'll happen when 1.999.. gets rounded from 64 to 32 bits by eg

static uint64 OnePointNineRepeating = 0x3FF FFFFF FFFF FFFF // exponent 0 (biased to 1023), all 1 bits in mantissa
double asDouble = *(double *)(&OnePointNineRepeating);
float asFloat = asDouble;
return asFloat;

Edit, result: original poster ran this test and found that with truncation, the 1.99999 will round up to 2 both with and without /arch:SSE2.

如果你确实调整了舍入以确保包含范围的两端,那么这些极端值是不是只有非极端值的一半?

With truncation, you are never going to be inclusive of the max.

Are you sure you really need the max? There is literally an almost 0 chance that you will land on exactly the maximum.

That said, you can exploit the fact that you are giving up precision and do something like this:

float MersenneFloat( float min, float max )
{
    double random = 100000.0; // just a dummy value
    while ((float)random > 65535.0)
    {
        //genrand returns a double in [1,2)
        double random = genrand_close1_open2() - 1.0; // now it's [0,1)
        random *= 65536.0; // now it's [0,65536). We try again if it's > 65535.0
    }
    //return in desired range
    return min + float(random/65535.0) * (max - min);
}

Note that, now, it has a slight chance of multiple calls to genrand each time you call MersenneFloat. So you have given up possible performance for a closed interval. Since you are downcasting from double to float, you end up sacrificing no precision.

Edit: improved algorithm

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM