简体   繁体   中英

IEEE 754 and machine numbers

I've been trying to wrap my head around machine numbers like the unit roundoff (u) and epsilon (e) in combination with the IEEE 754 standard. My textbook states some things that don't really make sense to me.

Unit roundoff according to my textbook is:

  • for single precision (mantissa is 23 bit): u = 6e-8
  • for double precision (mantissa is 52 bit): u = 2e-16

I've been trying to derive a formula for these results with two relations:

  • my textbook states: "In binary arithmetic with rounding we usually have e = 2*u"
    • e = 2^-n, n being the amount of mantissa bits

These combined results would then give: u = 2^-(n+1), again with n being the amount of mantissa bits. Checking this formule with the given results of u for different precisions:

for single: u = 2^-(23+1) = 5.96e-8, this result checks out. for double: u = 2^-(52+1) = 1.11e-16, this result doesn't check out.

Could someone please help me derive a correct formule for the unit roundoff, or point me to some mistakes I have been making? All help is appreciated.

This appears to be an error in your textbook.

The significands of the IEEE-754 basic 32- and 64-bit binary floating-point formats are 24 and 53 bits, respectively. 1 It is sometimes stated the significands are 23 bits and 52 bits, but this is a mistake. Those are the sizes of the main fields for encoding the significands, but the full 24-bit significand is encoded with 23 bits in the main significand field and 1 bit in the exponent field. Similarly, the full 53-bit significand is encoded with 52 bits in the main significand field and 1 bit in the exponent field. (The leading bit of the full significand comes from the exponent field: If the exponent field is zero, the leading significand bit is 0. If the exponent field is neither zero nor all ones, the leading significand bit is 1. If the exponent field is all ones, the floating-point object is a special value, either an infinity or a NaN.)

When the leading bit of the 24-bit significand represents the value 1, the least significant bit represents the value 2 −23 . That is the so-called epsilon. When a real number is being rounded to the nearest representable floating-point value, the maximum error is half the value of the least significant bit. (Because, if it were more than half the distance between two numbers, we would choose the number in the other direction, since it is closer.)

For a 53-bit significand, the least significant bit represents the value 2 −52 relative to the leading bit, and the maximum error when rounding to nearest is half that. So, for a leading bit of 1, the maximum rounding error should be 2 −53 , which is about 1.11•10 −16 . If your book says it is 2 −16 , it is incorrect.

Footnote

1 “Significand” is the preferred term. “Mantissa” is an old term for the fraction portion of a logarithm. Significands are linear. Mantissas are logarithmic.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM