简体   繁体   English

如何将结果以浮点运算取整?

[英]How are results rounded in floating-point arithmetic?

I wrote this code that simply sums a list of n numbers, to practice with floating point arithmetic, and I don't understand this: 我编写了这段代码,仅对n个数字列表求和,以进行浮点算术练习,但我不明白这一点:

I am working with float, this means I have 7 digits of precision, therefore, if I do the operation 10002*10002=100040004, the result in data type float will be 100040000.000000, since I lost any digit beyond the 7th (the program still knows the exponent, as seen here ). 我正在使用float,这意味着我有7位精度,因此,如果我执行操作10002 * 10002 = 100040004,则数据类型float的结果将是100040000.000000,因为我丢失了7位以外的任何数字(程序仍然知道指数,该指数看到这里 )。

If the input in this program is 如果此程序中的输入是

3
10000
10001
10002

You will see that, however, when this program computes 30003*30003=900180009 we have 30003*30003=900180032.000000 您将看到,但是,当此程序计算30003 * 30003 = 900180009时,我们有30003 * 30003 = 900180032.000000

I understand this 32 appears becasue I am working with float, and my goal is not to make the program more precise but understand why this is happening. 我知道出现这32个是因为我正在使用float,因此我的目标不是使程序更精确,而是了解为什么会这样。 Why is it 900180032.000000 and not 900180000.000000? 为什么是900180032.000000,而不是900180000.000000? Why does this decimal noise (32) appear in 30003*30003 and not in 10002*10002 even when the magnitude of the numbers are the same? 为什么即使数字的大小相同,十进制噪声(32)也会出现在30003 * 30003中而不出现在10002 * 10002中? Thank you for your time. 感谢您的时间。

#include <stdio.h>
#include <math.h>
#define MAX_SIZE 200


int main() 
{
int numbers[MAX_SIZE]; 
int i, N;
float sum=0;
float sumb=0;
float sumc=0;

printf("introduce n" );
scanf("%d", &N);

printf("write %d numbers:\n", N);
for(i=0; i<N; i++)
{
    scanf("%d", &numbers[i]);
}

int r=0;

while (r<N){
    sum=sum+numbers[r];
    sumb=sumb+(numbers[r]*numbers[r]); 
    printf("sum is %f\n",sum);
    printf("sumb is %f\n",sumb);
    r++;
}
sumc=(sum*sum);
printf("sumc is %f\n",sumc);
}

As explained below, the computed result of multiplying 10,002 by 10,002 must be a multiple of eight, and the computed result of multiplying 30,003 by 30,003 must be a multiple of 64, due to the magnitudes of the numbers and the number of bits available for representing them. 如下所述,由于数字的大小和可用于表示的位数,将10002乘以10002的计算结果必须是8的倍数,并且将30003乘以30003的计算结果必须是64的倍数。他们。 Although your question asks about “decimal noise,” there are no decimal digits involved here. 尽管您的问题询问“十进制噪声”,但此处不涉及十进制数字。 The results are entirely due to rounding to multiples of powers of two. 结果完全是由于舍入为2的幂的倍数。 (Your C implementation appears to use the common IEEE 754 format for binary floating-point.) (您的C实现似乎将通用IEEE 754格式用于二进制浮点。)

When you multiply 10,002 by 10,002, the computed result must be a multiple of eight. 将10002乘以10002时,计算结果必须是8的倍数。 I will explain why below. 我将在下面解释原因。 The mathematical result is 100,040,004. 数学结果为100,040,004。 The nearest multiples of eight are 100,040,000 and 100,040,008. 8的最接近倍数是100,040,000和100,040,008。 They are equally far from the exact result, and the rule used to break ties chooses the even multiple (100,040,000 is eight times 12,505,000, an even number, while 100,040,008 is eight times 12,505,001, an odd number). 它们与精确结果的距离也相差甚远,用于打平关系的规则选择偶数倍(100,040,000是12,505,000(偶数)的八倍,而100,040,008是12,505,001(奇数)的八倍)。

Many C implementations use IEEE 754 32-bit basic binary floating-point for float . 许多C实现都将IEEE 754 32位基本二进制浮点数用于float In this format, a number is represented as an integer M multiplied by a power of two 2 e . 以这种格式,数字表示为整数M乘以2 2 e的幂。 The integer M must be less than 2 24 in magnitude. 整数M的大小必须小于2 24 The exponent e may be from −149 to 104. These limits come from the numbers of bits used to represent the integer and the exponent. 指数e可以是-149到104。这些限制来自用来表示整数和指数的位数。

So all float values in this format have the value M • 2 e for some M and some e . 因此,所有float在这种格式值具有值M•2 E对于一些M和一些电子 There are no decimal digits in the format, just an integer multiplied by a power of two. 格式中没有小数位,只有整数乘以2的幂。

Consider the number 100,040,004. 考虑数字100,040,004。 The biggest M we can use is 16,777,215 (2 24 −1). 我们可以使用的最大M为16,777,215(2 24 -1)。 That is not big enough that we can write 100,040,004 as M • 2 0 . 这还不够大,我们可以将100,040,004写为M •2 0 So we must increase the exponent. 因此,我们必须增加指数。 Even with 2 2 , the biggest we can get is 16,777,215 • 2 2 = 67,108,860. 即使使用2 2 ,我们可以获得的最大金额为16,777,215•2 2 = 67,108,860。 So we must use 2 3 . 所以我们必须使用2 3 And that is why the computed result must be a multiple of eight, in this case. 这就是为什么在这种情况下,计算结果必须是八的倍数的原因。

So, to produce a result for 10,002•10,002 in float , the computer uses 12,505,000 • 2 3 , which is 100,040,000. 因此,要以float生成10,002•10,002的结果,计算机将使用12,505,000•2 3 ,即100,040,000。

In 30,003•30,003, the result must be a multiple of 64. The exact result is 900,180,009. 在30,003•30,003中,结果必须是64的倍数。确切的结果是900,180,009。 2 5 is not enough because 16,777,215•2 5 is 536,870,880. 2 5是不够的,因为16,777,215•2 5是536,870,880。 So we need 2 6 , which is 64. The two nearest multiples of 64 are 900,179,968 and 900,180,032. 因此我们需要2 6 ,即64。64的两个最接近的倍数是900,179,968和900,180,032。 In this case, the latter is closer (23 away versus 41 away), so it is chosen. 在这种情况下,后者更靠近(23远对41远),因此选择了它。

(While I have described the format as an integer times a power of two, it can also be described as a binary numeral with one binary digit before the radix point and 23 binary digits after it, with the exponent range adjusted to compensate. These are mathematically equivalent. The IEEE 754 standard uses the latter description. Textbooks may use the former description because it makes analyzing some of the numerical properties easier.) (尽管我将格式描述为整数乘以2的幂,但是也可以将其描述为二进制数,其中小数点前有一个二进制数,而小数点后有23个二进制数,并且对指数范围进行了调整以进行补偿。这些是在数学上是等效的。IEEE754标准使用后一种描述。教科书可以使用前一种描述,因为它使分析某些数值属性更加容易。

Floating point arithmetic is done in binary, not in decimal. 浮点算术以二进制而不是十进制完成。

Floats actually have 24 binary bits of precision, 1 of which is a sign bit and 23 of which are called significand bits. 浮点数实际上具有24位二进制精度,其中1位是符号位,而23位被称为有效位。 This converts to approximately 7 decimal digits of precision. 这将转换为大约 7位十进制数字的精度。

The number you're looking at, 900180032 , is already 9 digits long and so it makes sense that the last two digits (the 32 ) might be wrong. 您正在查看的数字900180032已经是9位数字,因此最后两位数字( 32 )可能有误,这是有道理的。 The rounding like the arithmetic is done in binary, the reason for the difference in rounding can only be seen if you break things down into binary. 像算术一样四舍五入是在二进制中完成的,仅当您将内容分解为二进制后才能看到舍入差异的原因。

900180032 = 110101101001111010100001 000000 900180032 = 110101101001111010100001 000000

900180000 = 1101011010011110101000001 00000 900180000 = 1101011010011110101000001 00000

If you count from the first 1 to the last 1 in each of those numbers (the part I put in bold), that is how many significand bits it takes to store the number. 如果您从每个数字的前1到后1(我用粗体显示的部分)计数,那就是存储该数字需要多少有效位。 900180032 takes only 23 significand bits to store while 900180000 takes 24 significand bits which makes 900180000 an impossible number to store as floats only have 23 significand bits. 900180032仅需要23个有效位来存储,而900180000需要24个有效位来存储,这使得900180000成为不可能存储的数字,因为浮点数只有23个有效位。 900180032 is the closest number to the correct answer, 900180009, that a float can store. 900180032是浮点数可以存储的最接近正确答案的数字900180009。

In the other example 在另一个例子中

100040000 = 101111101100111110101 000000 100040000 = 101111101100111110101 000000

100040004 = 1011111011001111101010001 00 100040004 = 1011111011001111101010001 00

The correct answer, 100040004 has 25 significand bits, too much for floats. 正确答案100040004有25个有效位,对于浮点数来说太大了。 The nearest number that has 23 or less significand bits is 10004000 which only has 21 significant bits. 具有23个或更少有效位的最近数字是10004000,只有21个有效位。

For more on floating point arithmetic works, try here http://steve.hollasch.net/cgindex/coding/ieeefloat.html 有关浮点运算的更多信息,请尝试在此处http://steve.hollasch.net/cgindex/coding/ieeefloat.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM