简体   繁体   English

IEEE-754的浮点数,双精度数和四进制数是否保证精确表示-2,-1,-0、0、1、2?

[英]Does IEEE-754 float, double and quad guarantee exact representation of -2, -1, -0, 0, 1, 2?

所有在标题:不IEEE-754 floatdoublequad保证确切表示-2-1-0012

它可以保证所有整数的精确表示,直到有效的二进制位数超过尾数范围为止。

Simple way to get answer for any decimal number, convert the absolute value to binary (24 bits for float, 53 bits for double, 113 bits for quad), then back to decimal, and see if you get same value back. 获取任何十进制数字的答案的简单方法是,将绝对值转换为二进制(浮点数为24位,双精度数为53位,四进制为113位),然后返回十进制,看看是否返回相同的值。

For integers, answer is obvious, you don't lose anything, unless value is too big to fit into given number of bits. 对于整数,答案是显而易见的,除非值太大而无法容纳给定的位数,否则您不会丢失任何东西。

Conversion of rational values with non-integer part is more interesting. 用非整数部分转换有理值会更有趣。 There you may lose precision when converting to a binary with some fixed width, and when converting back to decimal, you may get a decimal value with periodic decimal expansion (or again lose precision if you round it). 在这种情况下,当转换为具有固定宽度的二进制文件时,可能会失去精度,而在转换回十进制时,您可能会得到具有周期性十进制扩展的十进制值(或者,如果四舍五入,也会再次失去精度)。


Since you're dabbling with IEEE floats, first read the wikipedia page , then when you feel you're ready for more, proceed with the first external link there, "What Every Computer Scientist Should Know About Floating-Point Arithmetic" . 由于您不熟悉IEEE浮点数,因此,请先阅读Wikipedia页面 ,然后,如果您准备好了更多内容,请继续阅读第一个外部链接, “每个计算机科学家应该了解的浮点算术”

IEEE 754 floating point numbers can be used to store precisely integers of a certain ranges. IEEE 754浮点数可用于精确存储特定范围的整数。 For example: 例如:

  • binary32 , implemented in C/C++ as float , provides 24 bits of precision and therefore can represent with full precision 16-bit integers, eg short int ; 在C / C ++中以float实现的binary32提供24位精度,因此可以以全精度表示16位整数,例如short int
  • binary64 , implemented in C/C++ as double , provides 53 bits of precision and can represent exactly 32-bit integers, eg int ; 在C / C ++中以double实现的binary64提供53位精度,并且可以精确地表示32位整数,例如int
  • the non-standard Intel 80-bit precision, implemented as long double by some x86/x64 compilers, provides 64 significant bits and can represent 64-bit integers, eg long int (on LP64 systems, eg Unix) or long long int (on LLP64 systems, eg Windows); 由某些x86 / x64编译器实现为long double的非标准Intel 80位精度,提供了64位有效位,并且可以表示64位整数,例如long int (在LP64系统上,例如Unix)或long long int (在LLP64系统,例如Windows);
  • binary128 , implemented as compiler-specific types such as __float128 (GCC) or _Quad (Intel C/C++), provides 113 bits in the mantissa and therefore can represent exactly 64-bit integers. binary128 ,实现为特定于编译器的类型,例如__float128 (GCC)或_Quad (Intel C / C ++),在尾数中提供113位,因此可以精确表示64位整数。

The fact that double fits an extended range of integers, even surpassing the range of 32-bit integers, is used in JavaScript, which doesen't have special integer numerical type and instead uses double precision floating-point to represent integers . JavaScript中使用double适应整数的扩展范围,甚至超过了32位整数的范围,这一事实没有特殊的整数数值类型,而是使用double precision浮点数来表示整数

One quirk of floating-point numbers is that they have separate sign bit and therefore things like positive and negative zeros exist, which is not possible in the two's complement signed integer representation. 浮点数的一个怪癖是它们具有独立的符号位,因此存在诸如正零和负零之类的东西,这在二进制补码有符号整数表示中是不可能的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM