OpenCL浮点精度

Question

I found a problem with host - client float standard in OpenCL. 我在OpenCL中发现了主机 - 客户端浮点标准的问题。 The problem was that the floating points calculated by Opencl is not in the same floating point limits as my visual studio 2010 compiler, when compiling in x86. 问题是，在x86中编译时，Opencl计算的浮点数与我的visual studio 2010编译器的浮点数不同。 However when compiling in x64 they are in the same limit. 但是，在x64中进行编译时，它们处于相同的限制。 I know it has to be something with, http://www.viva64.com/en/b/0074/ 我知道它必须与之相关， http：//www.viva64.com/en/b/0074/

The source I used during testing was: http://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism When i ran the program in x86 it would give me 202 numbers that were equal, when the kernel and the C++ program took square of 1269760 numbers. 我在测试过程中使用的来源是： http ： //www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism当我在x86中运行程序时，它会给我202个相同的数字，当时内核和C ++程序占用了1269760个数字。 However in 64 bit build, 1269760 numbers were right, in other words 100 %. 然而，在64位构建中，1269760数字是正确的，换句话说是100％。 Furthermore, I found that the error between the calculated result of opencl and x86 c++, was 5.5385384e-014, which is a very small fraction but not small enough, compared to the epsilon of the number, which was 2.92212543378266922312416e-19. 此外，我发现opencl和x86 c ++的计算结果之间的误差是5.5385384e-014，这是一个非常小的部分但不够小，与数字的epsilon相比，即2.92212543378266922312416e-19。
That's because, the error needs to be smaller than the epsilon, so that the program can recognize the two numbers as one single equal number. 这是因为，错误需要小于epsilon，因此程序可以将这两个数字识别为一个相同的数字。 Of course normally one would never compare floats natively, but it is good to know that the float limits are different. 当然，通常人们永远不会比较本地的浮子，但很高兴知道浮动限制是不同的。 And yes i tried to set flt:static, but got the same error. 是的，我试图设置flt：static，但得到了同样的错误。

So I want some sort of explanation for this behavior. 所以我想对这种行为做一些解释。 Thanks in advance for all answers. 提前感谢所有答案。

Answer 1

Since nothing changes in the GPU code as you switch your project from x86 to x64, it all has to do as how multiplication is performed on the CPU. 由于在将项目从x86切换到x64时GPU代码没有任何变化，所以这一切都必须像在CPU上执行乘法一样。 There are some subtle differences between floating-point numbers handling in x86 and x64 modes and the biggest one is that since any x64 CPU also supports SSE and SSE2, it is used by default for math operations in 64-bit mode on Windows. 在x86和x64模式下浮点数处理之间存在一些细微差别，最大的一点是因为任何x64 CPU也支持SSE和SSE2，默认情况下它用于Windows上64位模式下的数学运算。

The HD4770 GPU does all computations using single-precision floating point units. HD4770 GPU使用单精度浮点单元完成所有计算。 Modern x64 CPUs on the other hand have two kinds of functional units that handle floating point numbers: 另一方面，现代x64 CPU有两种处理浮点数的功能单元：

x87 FPU which operates with the much higher extended precision of 80 bits x87 FPU，具有更高的80位扩展精度
SSE FPU which operates with 32-bit and 64-bit precision and is much compatible with how other CPUs handle floating point numbers SSE FPU以32位和64位精度运行，与其他CPU处理浮点数的方式非常兼容

In 32-bit mode the compiler does not assume that SSE is available and generates usual x87 FPU code to do the math. 在32位模式下，编译器不会假设SSE可用并生成通常的x87 FPU代码来进行数学运算。 In this case operations like data[i] * data[i] are performed internally using the much higher 80-bit precision. 在这种情况下，像data[i] * data[i]这样的操作是使用更高的80位精度在内部执行的。 Comparison of the kind if (results[i] == data[i] * data[i]) is performed as follows: 种类if (results[i] == data[i] * data[i])比较如下：

data[i] is pushed onto the x87 FPU stack using the FLD DWORD PTR data[i] data[i]使用FLD DWORD PTR data[i]推入x87 FPU堆栈
data[i] * data[i] is computed using FMUL DWORD PTR data[i] data[i] * data[i]使用FMUL DWORD PTR data[i]
result[i] is pushed onto the x87 FPU stack using FLD DWORD PTR result[i] result[i]被压入用的x87 FPU堆栈FLD DWORD PTR result[i]
both values are compared using FUCOMPP 使用FUCOMPP比较两个值

Here comes the problem. 这就是问题所在。 data[i] * data[i] resides in an x87 FPU stack element in 80-bit precision. data[i] * data[i]以80位精度驻留在x87 FPU堆栈元素中。 result[i] comes from the GPU in 32-bit precision. result[i]以32位精度来自GPU。 Both numbers will most likely differ since data[i] * data[i] has much more significant digits whereas result[i] has lots of zeros (in 80-bit precision)! 这两个数字很可能会有所不同，因为data[i] * data[i]有更多有效数字，而result[i]有很多零（80位精度）！

In 64-bit mode things happen in another way. 在64位模式下，事情以另一种方式发生。 The compiler knows that your CPU is SSE capable and it uses SSE instructions to do the math. 编译器知道您的CPU支持SSE，它使用SSE指令进行数学运算。 The same comparison statement is performed in the following way on x64: 在x64上以下列方式执行相同的比较语句：

data[i] is loaded into an SSE register using MOVSS XMM0, DWORD PTR data[i] 使用MOVSS XMM0, DWORD PTR data[i] data[i]加载到SSE寄存器中
data[i] * data[i] is computed using MULSS XMM0, DWORD PTR data[i] data[i] * data[i]使用MULSS XMM0, DWORD PTR data[i]
result[i] is loaded into another SSE register using MOVSS XMM1, DWORD PTR result[i] 使用MOVSS XMM1, DWORD PTR result[i] result[i]加载到另一个SSE寄存器中
both values are compared using UCOMISS XMM1, XMM0 使用UCOMISS XMM1, XMM0比较两个值

In this case the square operation is performed with the same 32-bit single point precision as is used on the GPU. 在这种情况下，使用与GPU上使用的32位单点精度相同的方形执行方形操作。 No intermediate results with 80-bit precision are generated. 不会生成80位精度的中间结果。 That's why results are the same. 这就是为什么结果是一样的。

It is very easy to actually test this even without GPU being involved. 即使没有涉及GPU，也很容易实际测试。 Just run the following simple program: 只需运行以下简单程序：

#include <stdlib.h>
#include <stdio.h>

float mysqr(float f)
{
    f *= f;
    return f;
}

int main (void)
{
    int i, n;
    float f, f2;

    srand(1);
    for (i = n = 0; n < 1000000; n++)
    {
        f = rand()/(float)RAND_MAX;
        if (mysqr(f) != f*f) i++;
    }
    printf("%d of %d squares differ\n", i);
    return 0;
}

mysqr is specifically written so that the intermediate 80-bit result will get converted in 32-bit precision float . mysqr是专门编写的，这样中间的80位结果将以32位精度float转换。 If you compile and run in 64-bit mode, output is: 如果您在64位模式下编译并运行，则输出为：

0 of 1000000 squares differ

If you compile and run in 32-bit mode, output is: 如果编译并以32位模式运行，则输出为：

999845 of 1000000 squares differ

In principle you should be able to change the floating point model in 32-bit mode ( Project properties -> Configuration Properties -> C/C++ -> Code Generation -> Floating Point Model ) but doing so changes nothing since at least on VS2010 intermediate results are still kept in the FPU. 原则上你应该能够在32位模式下更改浮点模型（ 项目属性 - >配置属性 - > C / C ++ - >代码生成 - >浮点模型 ），但这样做至少在VS2010中间没有任何改变结果仍保留在FPU中。 What you can do is to enforce store and reload of the computed square so that it will be rounded to 32-bit precision before it is compared with the result from the GPU. 您可以做的是强制存储和重新加载计算的方块，以便在将其与GPU的结果进行比较之前将其舍入为32位精度。 In the simple example above this is achieved by changing: 在上面的简单示例中，这是通过更改：

if (mysqr(f) != f*f) i++;

to 至

if (mysqr(f) != (float)(f*f)) i++;

After the change 32-bit code output becomes: 更改后，32位代码输出变为：

0 of 1000000 squares differ

Answer 2

In my case 就我而言

(float)(f*f)

didn't help. 没有帮助。 I used 我用了

  correct = 0;
  for(unsigned int i = 0; i < count; i++) {
    volatile float sqr = data[i] * data[i];
    if(results[i] == sqr)
      correct++;
  }

instead. 代替。

OpenCL浮点精度

问题描述

2 个解决方案

解决方案1
9 已采纳 2012-06-26 14:09:39

解决方案2
-1 2012-09-11 22:59:49

OpenCL浮点精度

问题描述

2 个解决方案

解决方案1 9 已采纳 2012-06-26 14:09:39

解决方案2 -1 2012-09-11 22:59:49

解决方案1
9 已采纳 2012-06-26 14:09:39

解决方案2
-1 2012-09-11 22:59:49