简体   繁体   English

“浮动”与“双精度”

[英]'float' vs. 'double' precision

The code编码

float x  = 3.141592653589793238;
double z = 3.141592653589793238;
printf("x=%f\n", x);
printf("z=%f\n", z);
printf("x=%20.18f\n", x);
printf("z=%20.18f\n", z);

will give you the output会给你输出

x=3.141593
z=3.141593
x=3.141592741012573242
z=3.141592653589793116

where on the third line of output 741012573242 is garbage and on the fourth line 116 is garbage.其中输出的第三行741012573242是垃圾,第四行116是垃圾。 Do doubles always have 16 significant figures while floats always have 7 significant figures?双打总是有 16 位有效数字,而浮动总是有 7 位有效数字吗? Why don't doubles have 14 significant figures?为什么双打没有 14 位有效数字?

Floating point numbers in C use IEEE 754 encoding. C 中的浮点数使用IEEE 754编码。

This type of encoding uses a sign, a significand, and an exponent.这种类型的编码使用一个符号、一个有效数和一个指数。

Because of this encoding, many numbers will have small changes to allow them to be stored.由于这种编码,许多数字将有小的变化以允许它们被存储。

Also, the number of significant digits can change slightly since it is a binary representation, not a decimal one.此外,有效数字的数量可能会略有变化,因为它是二进制表示,而不是十进制表示。

Single precision (float) gives you 23 bits of significand, 8 bits of exponent, and 1 sign bit.单精度 (float) 为您提供 23 位有效数、8 位指数和 1 个符号位。

Double precision (double) gives you 52 bits of significand, 11 bits of exponent, and 1 sign bit.双精度 (double) 为您提供 52 位有效数、11 位指数和 1 个符号位。

Do doubles always have 16 significant figures while floats always have 7 significant figures?双打总是有 16 位有效数字,而浮动总是有 7 位有效数字吗?

No. Doubles always have 53 significant bits and floats always have 24 significant bits (except for denormals, infinities, and NaN values, but those are subjects for a different question).不是。双精度数总是有 53 个有效,浮点数总是有 24 个有效(非正规数、无穷大和 NaN 值除外,但这些是不同问题的主题)。 These are binary formats, and you can only speak clearly about the precision of their representations in terms of binary digits (bits).这些是二进制格式,您只能清楚地说明它们以二进制数字(位)表示的精度。

This is analogous to the question of how many digits can be stored in a binary integer: an unsigned 32 bit integer can store integers with up to 32 bits, which doesn't precisely map to any number of decimal digits: all integers of up to 9 decimal digits can be stored, but a lot of 10-digit numbers can be stored as well.这类似于一个二进制整数中可以存储多少位数字的问题:一个无符号的 32 位整数可以存储最多 32 位的整数,它不能精确地映射到任何数量的十进制数字:所有整数最多为可以存储 9 位十进制数字,但也可以存储很多 10 位数字。

Why don't doubles have 14 significant figures?为什么双打没有 14 位有效数字?

The encoding of a double uses 64 bits (1 bit for the sign, 11 bits for the exponent, 52 explicit significant bits and one implicit bit), which is double the number of bits used to represent a float (32 bits). double 的编码使用 64 位(符号 1 位,指数 11 位,52 个显式有效位和 1 个隐式位),这是用于表示浮点数(32 位)的两倍

  • float : 23 bits of significand, 8 bits of exponent, and 1 sign bit. float : 23 位有效数,8 位指数和 1 个符号位。
  • double : 52 bits of significand, 11 bits of exponent, and 1 sign bit. double : 52 位有效数,11 位指数和 1 个符号位。

It's usually based on significant figures of both the exponent and significand in base 2, not base 10. From what I can tell in the C99 standard, however, there is no specified precision for floats and doubles (other than the fact that 1 and 1 + 1E-5 / 1 + 1E-7 are distinguishable [ float and double repsectively]).它通常基于以 2 为底的指数和有效数的有效数字,而不是以 10 为底。但是,根据我在 C99 标准中的说法,浮点数和双精度数没有指定的精度(除了 1 和1 + 1E-5 / 1 + 1E-7是可区分的 [ floatdouble分别])。 However, the number of significant figures is left to the implementer (as well as which base they use internally, so in other words, an implementation could decide to make it based on 18 digits of precision in base 3).但是,有效数字的数量留给实现者(以及他们内部使用的基数,因此换句话说,实现可以决定基于基数 3 中的 18 位精度)。 [1] [1]

If you need to know these values, the constants FLT_RADIX and FLT_MANT_DIG (and DBL_MANT_DIG / LDBL_MANT_DIG ) are defined in float.h.如果您需要了解这些值,常量FLT_RADIXFLT_MANT_DIG (和DBL_MANT_DIG / LDBL_MANT_DIG )在float.h中定义。

The reason it's called a double is because the number of bytes used to store it is double the number of a float (but this includes both the exponent and significand).它被称为double的原因是因为用于存储它的字节数是浮点数的两倍(但这包括指数和有效数)。 The IEEE 754 standard (used by most compilers) allocate relatively more bits for the significand than the exponent (23 to 9 for float vs. 52 to 12 for double ), which is why the precision is more than doubled. IEEE 754标准(由大多数编译器使用)分配相对更多的位用于有效数比指数(23〜9 float与52至12为double ),这就是为什么精度一倍以上。

1: Section 5.2.4.2.2 ( http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf ) 1:第 5.2.4.2.2 节( http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf

浮点数有 23 位精度,双精度数有 52 位。

It's not exactly double precision because of how IEEE 754 works, and because binary doesn't really translate well to decimal.由于IEEE 754 的工作方式,并且因为二进制不能很好地转换为十进制,所以它不完全是精度。 Take a look at the standard if you're interested.如果您有兴趣,请查看标准。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM