简体   繁体   English

截断时浮点舍入

[英]Floating point rounding when truncating

This is probably a question for an x86 FPU expert: 这可能是x86 FPU专家的一个问题:

I am trying to write a function which generates a random floating point value in the range [min,max]. 我正在尝试编写一个函数,该函数生成[min,max]范围内的随机浮点值。 The problem is that my generator algorithm (the floating-point Mersenne Twister, if you're curious) only returns values in the range [1,2) - ie, I want an inclusive upper bound, but my "source" generated value is from an exclusive upper bound. 问题是我的生成器算法(浮点Mersenne Twister,如果你很好奇)只返回范围[1,2]中的值 - 即,我想要一个包含上限,但我的“源”生成的值是从独家上限。 The catch here is that the underlying generator returns an 8-byte double, but I only want a 4-byte float, and I am using the default FPU rounding mode of Nearest. 这里的问题是底层生成器返回一个8字节的双精度,但我只想要一个4字节的浮点数,而我正在使用Nearest的默认FPU舍入模式。

What I want to know is whether the truncation itself in this case will result in my return value being inclusive of max when the FPU internal 80-bit value is sufficiently close, or whether I should increment the significand of my max value before multiplying it by the intermediary random in [1,2), or whether I should change FPU modes. 我想知道的是,在这种情况下,截断本身是否会导致我的返回值包含FPU内部80位值足够接近时的最大值,或者我是否应该在将其乘以之前递增最大值的有效数[1,2]中的中间随机,或者我是否应该改变FPU模式。 Or any other ideas, of course. 当然还有其他任何想法。

Here's the code I am currently using, and I did verify that 1.0f resolves to 0x3f800000: 这是我目前使用的代码,我确认1.0f解析为0x3f800000:

float MersenneFloat( float min, float max )
{
    //genrand returns a double in [1,2)
    const float random = (float)genrand_close1_open2(); 
    //return in desired range
    return min + ( random - 1.0f ) * (max - min);
}

If it makes a difference, this needs to work on both Win32 MSVC++ and Linux gcc. 如果它有所不同,这需要在Win32 MSVC ++和Linux gcc上运行。 Also, will using any versions of the SSE optimizations change the answer to this? 此外,使用任何版本的SSE优化会改变答案吗?

Edit: The answer is yes, truncation in this case from double to float is sufficient to cause the result to be inclusive of max. 编辑:答案是肯定的,在这种情况下,从double到float的截断足以导致结果包含max。 See Crashworks' answer for more. 有关更多信息,请参阅Crashworks的答案。

The SSE ops will subtly change the behavior of this algorithm because they don't have the intermediate 80-bit representation -- the math truly is done in 32 or 64 bits. SSE操作将巧妙地改变该算法的行为,因为它们没有中间的80位表示 - 数学真正以32位或64位完成。 The good news is that you can easily test it and see if it changes your results by simply specifying the /ARCH:SSE2 command line option to MSVC, which will cause it to use the SSE scalar ops instead of x87 FPU instructions for ordinary floating point math. 好消息是,您可以通过简单地为MSVC指定/ ARCH:SSE2命令行选项来轻松测试它并查看它是否会改变您的结果,这将导致它使用SSE标量操作而不是x87 FPU指令用于普通浮点数学。

I'm not sure offhand of what the exact rounding behavior is around the integer boundaries, but you can test to see what'll happen when 1.999.. gets rounded from 64 to 32 bits by eg 我没有确切的四舍五入行为周围的整数界限什么肯定的副手,但你可以测试一下,看看会发生什么时1.999 ..会从64位到32位的四舍五入

static uint64 OnePointNineRepeating = 0x3FF FFFFF FFFF FFFF // exponent 0 (biased to 1023), all 1 bits in mantissa
double asDouble = *(double *)(&OnePointNineRepeating);
float asFloat = asDouble;
return asFloat;

Edit, result: original poster ran this test and found that with truncation, the 1.99999 will round up to 2 both with and without /arch:SSE2. 编辑,结果:原始海报运行此测试,发现截断时,1.99999将使用和不使用/ arch:SSE2向上舍入到2。

如果你确实调整了舍入以确保包含范围的两端,那么这些极端值是不是只有非极端值的一半?

With truncation, you are never going to be inclusive of the max. 截断时,你永远不会包含最大值。

Are you sure you really need the max? 你确定你真的需要最大值吗? There is literally an almost 0 chance that you will land on exactly the maximum. 实际上你几乎有可能获得最大值。

That said, you can exploit the fact that you are giving up precision and do something like this: 也就是说,你可以利用你放弃精度的事实并做这样的事情:

float MersenneFloat( float min, float max )
{
    double random = 100000.0; // just a dummy value
    while ((float)random > 65535.0)
    {
        //genrand returns a double in [1,2)
        double random = genrand_close1_open2() - 1.0; // now it's [0,1)
        random *= 65536.0; // now it's [0,65536). We try again if it's > 65535.0
    }
    //return in desired range
    return min + float(random/65535.0) * (max - min);
}

Note that, now, it has a slight chance of multiple calls to genrand each time you call MersenneFloat. 请注意,现在,每次调用MersenneFloat时,它都会轻微多次调用genrand。 So you have given up possible performance for a closed interval. 因此,您已经放弃了关闭间隔的可能性能。 Since you are downcasting from double to float, you end up sacrificing no precision. 既然你是从双向下转换到浮动,你最终会牺牲精度。

Edit: improved algorithm 编辑:改进的算法

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM