简体   繁体   中英

Does IEEE-754 float, double and quad guarantee exact representation of -2, -1, -0, 0, 1, 2?

所有在标题:不IEEE-754 floatdoublequad保证确切表示-2-1-0012

它可以保证所有整数的精确表示,直到有效的二进制位数超过尾数范围为止。

Simple way to get answer for any decimal number, convert the absolute value to binary (24 bits for float, 53 bits for double, 113 bits for quad), then back to decimal, and see if you get same value back.

For integers, answer is obvious, you don't lose anything, unless value is too big to fit into given number of bits.

Conversion of rational values with non-integer part is more interesting. There you may lose precision when converting to a binary with some fixed width, and when converting back to decimal, you may get a decimal value with periodic decimal expansion (or again lose precision if you round it).


Since you're dabbling with IEEE floats, first read the wikipedia page , then when you feel you're ready for more, proceed with the first external link there, "What Every Computer Scientist Should Know About Floating-Point Arithmetic" .

IEEE 754 floating point numbers can be used to store precisely integers of a certain ranges. For example:

  • binary32 , implemented in C/C++ as float , provides 24 bits of precision and therefore can represent with full precision 16-bit integers, eg short int ;
  • binary64 , implemented in C/C++ as double , provides 53 bits of precision and can represent exactly 32-bit integers, eg int ;
  • the non-standard Intel 80-bit precision, implemented as long double by some x86/x64 compilers, provides 64 significant bits and can represent 64-bit integers, eg long int (on LP64 systems, eg Unix) or long long int (on LLP64 systems, eg Windows);
  • binary128 , implemented as compiler-specific types such as __float128 (GCC) or _Quad (Intel C/C++), provides 113 bits in the mantissa and therefore can represent exactly 64-bit integers.

The fact that double fits an extended range of integers, even surpassing the range of 32-bit integers, is used in JavaScript, which doesen't have special integer numerical type and instead uses double precision floating-point to represent integers .

One quirk of floating-point numbers is that they have separate sign bit and therefore things like positive and negative zeros exist, which is not possible in the two's complement signed integer representation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM