IEEE-754 单精度表示的最大绝对和相对误差？

Question

I'm looking to find the maximum overall absolute and relative error of IEEE-754 single precision representation.我正在寻找 IEEE-754 单精度表示的最大总体绝对和相对误差。 Sign: 1 bit, Exponent: 8 bits, Significand: 23 bits.符号：1 位，指数：8 位，尾数：23 位。

I understood that when normalised, the maximum number of digits in the significand would be 23 (and we assume a sign bit and exponent of 8 obviously).我知道当归一化时，有效数字中的最大位数为 23（我们显然假设符号位和指数为 8）。 Hence if any extra digits turned up, then the error would propagate from 2^-24 onwards ie 2^-24, 2^-25, 2^-26... Hence I completed a geometric infinite sum of this to find an error: so i got 2^-23.因此，如果出现任何额外的数字，那么错误将从 2^-24 开始传播，即 2^-24、2^-25、2^-26...因此我完成了这个的几何无限求和以找到错误: 所以我得到 2^-23。 However, I'm unsure whether this is correct for the relative error.但是，我不确定这对于相对错误是否正确。 Relative error would be the ((true value-given value)/true value)*100.相对误差为（（真值-给定值）/真值）*100。 I'm not sure if this is a wrong approach.我不确定这是否是错误的方法。

Additionally, I'm confused on how to find an absolute error.此外，我对如何找到绝对错误感到困惑。 Could anyone assist please.任何人都可以帮忙吗？ Thanks in advance.提前致谢。

Answer 1

All finite IEEE-754 single precision are exact.所有有限的 IEEE-754 单精度都是精确的。 There is no error in the value itself.值本身没有错误。

A calculation/conversion may incur an error as there are only about 2 ³² different IEEE-754 single precision values and there are infinite possible calculations results.计算/转换可能会产生错误，因为只有大约 2 ^{32 个}不同的 IEEE-754 单精度值，并且有无限可能的计算结果。 Typically a nearby single precision value is selected when the true result is not encodable.当真实结果不可编码时，通常会选择附近的单精度值。

If we limit the discussion to calculation results that are within a pair of finite single precision values, then the error could be at most 1.0 ULP ^*1 .如果我们将讨论限制在一对有限单精度值内的计算结果，则误差最多为 1.0 ULP ^*1 。

Note: finite range +/-3.4028235... × 10 ³⁸ or FLT_MAX注意：有限范围 +/-3.4028235... × 10 ³⁸或FLT_MAX

Within that range, the absolute difference between the true result and the encoded single precision is then at most FLT_MAX - next_smallest_float(FLOAT_MAX) .在该范围内，真实结果与编码单精度之间的绝对差值最多FLT_MAX - next_smallest_float(FLOAT_MAX) 。 This is close to FLOAT_MAX * pow(2,-24) (about 2.03 * 10 ³¹ ).这接近FLOAT_MAX * pow(2,-24) （大约 2.03 * 10 ³¹ ）。 Single precision has a 24-bit significand (23-bits explicitly encoded, 1 implied).单精度具有 24 位有效数（23 位显式编码，隐含 1）。

Outside that range the absolute error can be infinite.在该范围之外，绝对误差可能是无限大的。

For many calculations, when the results are in the normal single precision range, the relative error is within 1.0 * ULP of the correct answer ^*1 .对于许多计算，当结果在正常的单精度范围内时，相对误差在正确答案^*1的1.0 * ULP以内。 For transcendental calculations like sine , the error is within 2.0 * ULP of the correct answer.对于sine等超越计算，误差在正确答案的2.0 * ULP范围内。 That can be much worse for weak implementations.对于弱实施来说，这可能会更糟。

When the true result is small and the single precision value is a non-zero sub-normal , the relative error grows as the true value nears 0.0 until 0.5 * pow(2,0) or 1/2.当真实结果较小且单精度值为非零次正常值时，相对误差会随着真实值接近 0.0 增加，直到0.5 * pow(2,0)或 1/2。 Note this is considering the relative error as:请注意，这是将相对误差视为：

relative_error_IEEE = |true value - IEEE value|/IEEE value

When the IEEE value is zero or the relative error is determined as below, the relative error approaches infinity.当 IEEE 值为零或相对误差如下确定时，相对误差接近无穷大。

relative_error_true = |true value - IEEE value|/true value

^*1 Common calculations like +,-,*,/ should be within 0.5 ULP when the rounding mode is round-to-nearest . ^*1当舍入模式为round-to-nearest时+,-,*,/等常见计算应在 0.5 ULP 以内。

Answer 2

The largest error is 10141204801825835211973625643008, and the largest relative error is 0.5:最大误差为10141204801825835211973625643008，最大相对误差为0.5：

>>> (2**(0xfe-150)* 0xffffff - 2**(0xfe-150)* 0xfffffe)/2 
10141204801825835211973625643008L
>>> 2**(0xfe-150)* 0xffffff
340282346638528859811704183484516925440L
>>> print ("%100.100f\n" % (10141204801825835211973625643007.0/340282346638528859811704183484516925440.0))
0.0000000298023241640522577793688714653530524856250849552452564239501953125000000000000000000000000000

>>> print("%151.151f\n" % ( ( 2**(0x0-150)* 0x000002 - 2**(0x0-150)* 0x000001 )/2 ))
0.0000000000000000000000000000000000000000000003503246160812042677309323958224790328200654854691289429392670709724477706714651503716595470905303955078125

>>> print("%151.151f\n" % (2**(0x0-150)* 0x00001))
0.0000000000000000000000000000000000000000000007006492321624085354618647916449580656401309709382578858785341419448955413429303007433190941810607910156250

>>> 3503246160812042677309323958224790328200654854691289429392670709724477706714651503716595470905303955078125.0/7006492321624085354618647916449580656401309709382578858785341419448955413429303007433190941810607910156250
0.5

IEEE-754 单精度表示的最大绝对和相对误差？

问题描述

2 个解决方案

解决方案1
1 2023-01-06 04:30:57

解决方案2
-2 2023-01-06 10:42:50

IEEE-754 单精度表示的最大绝对和相对误差？

问题描述

2 个解决方案

解决方案1 1 2023-01-06 04:30:57

解决方案2 -2 2023-01-06 10:42:50

解决方案1
1 2023-01-06 04:30:57

解决方案2
-2 2023-01-06 10:42:50