带有odeint和VexCL的Lorenz示例在不同设备上产生不同的结果

Question

Update: 更新：

I've run this example with other systems. 我已经在其他系统上运行了该示例。 On an Intel i7-3630QM, Intel HD4000 and Radeon HD 7630M, all results are the same. 在Intel i7-3630QM，Intel HD4000和Radeon HD 7630M上，所有结果均相同。 With an i7-4700MQ / 4800MQ the results of the CPU are different when OpenCL or a 64 bit gcc is used from an 32 bit gcc. 对于i7-4700MQ / 4800MQ，从32位gcc使用OpenCL或64位gcc时，CPU的结果是不同的。 This is a result of the 64 bit gcc and OpenCl using SSE by default and the 32 bit gcc using 387 math, at least the 64 bit gcc produces the same results when -mfpmath=387 is set. 这是默认情况下使用SSE的64位gcc和OpenCl以及使用387数学的32位gcc的结果，当设置-mfpmath = 387时，至少64位gcc会产生相同的结果。 So I have to read a lot more and experiment with x86 floating point. 因此，我必须阅读更多内容并尝试使用x86浮点。 Thank you all for your answers. 谢谢大家的答案。

I've run the Lorenz system example from "Programming CUDA and OpenCL: A case study using modern C++ libraries" for ten systems each on different OpenCL devices and am getting different results: 我已经运行了“编程CUDA和OpenCL：使用现代C ++库的案例研究”中的Lorenz系统示例，该示例针对十个分别位于不同OpenCL设备上的系统，并获得了不同的结果：

Quadro K1100M (NVIDIA CUDA) Quadro K1100M（NVIDIA CUDA）
R => xyz R => xyz
0.100000 => -0.000000 -0.000000 0.000000 0.100000 => -0.000000 -0.000000 0.000000
5.644444 => -3.519254 -3.519250 4.644452 5.644444 => -3.519254 -3.519250 4.644452
11.188890 => 5.212534 5.212530 10.188904 11.188890 => 5.212534 5.212530 10.188904
16.733334 => 6.477303 6.477297 15.733333 16.733334 => 6.477303 6.477297 15.733333

22.277779 => 3.178553 2.579687 17.946903 22.277779 => 3.178553 2.579687 17.946903
27.822224 => 5.008720 7.753564 16.377680 27.822224 => 5.008720 7.753564 16.377680
33.366669 => -13.381100 -15.252210 36.107887 33.366669 => -13.381100 -15.252210 36.107887
38.911114 => 4.256534 6.813675 23.838787 38.911114 => 4.256534 6.813675 23.838787
44.455555 => -11.083726 0.691549 53.632290 44.455555 => -11.083726 0.691549 53.632290
50.000000 => -8.624105 -15.728293 32.516193 50.000000 => -8.624105 -15.728293 32.516193
Intel(R) HD Graphics 4600 (Intel(R) OpenCL) 英特尔（R）高清显卡4600（英特尔（R）OpenCL）
R => xyz R => xyz
0.100000 => -0.000000 -0.000000 0.000000 0.100000 => -0.000000 -0.000000 0.000000
5.644444 => -3.519253 -3.519250 4.644451 5.644444 => -3.519253 -3.519250 4.644451
11.188890 => 5.212531 5.212538 10.188890 11.188890 => 5.212531 5.212538 10.188890
16.733334 => 6.477320 6.477326 15.733339 16.733334 => 6.477320 6.477326 15.733339

22.277779 => 7.246771 7.398651 20.735369 22.277779 => 7.246771 7.398651 20.735369
27.822224 => -6.295782 -10.615027 14.646572 27.822224 => -6.295782 -10.615027 14.646572
33.366669 => -4.132523 -7.773201 14.292910 33.366669 => -4.132523 -7.773201 14.292910
38.911114 => 14.183139 19.582197 37.943520 38.911114 => 14.183139 19.582197 37.943520
44.455555 => -3.129006 7.564254 45.736408 44.455555 => -3.129006 7.564254 45.736408
50.000000 => -9.146419 -17.006729 32.976696 50.000000 => -9.146419 -17.006729 32.976696
Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz (Intel(R) OpenCL) 英特尔®酷睿TM i7-4800MQ CPU @ 2.70GHz（英特尔（R）OpenCL）
R => xyz R => xyz
0.100000 => -0.000000 -0.000000 0.000000 0.100000 => -0.000000 -0.000000 0.000000
5.644444 => -3.519254 -3.519251 4.644453 5.644444 => -3.519254 -3.519251 4.644453
11.188890 => 5.212513 5.212507 10.188900 11.188890 => 5.212513 5.212507 10.188900
16.733334 => 6.477303 6.477296 15.733332 16.733334 => 6.477303 6.477296 15.733332

22.277779 => -8.295195 -8.198518 22.271002 22.277779 => -8.295195 -8.198518 22.271002
27.822224 => -4.329878 -4.022876 22.573458 27.822224 => -4.329878 -4.022876 22.573458
33.366669 => 9.702943 3.997370 38.659538 33.366669 => 9.702943 3.997370 38.659538
38.911114 => 16.105495 14.401397 48.537579 38.911114 => 16.105495 14.401397 48.537579
44.455555 => -12.551083 -9.239071 49.378693 44.455555 => -12.551083 -9.239071 49.378693
50.000000 => 7.377638 3.447747 47.542763 50.000000 => 7.377638 3.447747 47.542763

As you can see, the three devices agree on the values up to R=16.733334 and then start to diverge. 如您所见，这三个设备在R = 16.733334以下的值上达成一致，然后开始分歧。

I have run the same region with odeint without VexCL and get results close to the outcome of the OpenCL on CPU run: 我在没有VexCL的情况下使用odeint运行了相同的区域，并且在CPU运行时得到的结果接近OpenCL的结果：

Vanilla odeint: 香草颂

R => x y z
16.733334 => 6.47731 6.47731 15.7333
22.277779 =>  -8.55303 -6.72512 24.7049
27.822224 => 3.88874 3.72254 21.8227

The example code can be found here: https://github.com/ddemidov/gpgpu_with_modern_cpp/blob/master/src/lorenz_ensemble/vexcl_lorenz_ensemble.cpp 示例代码可在此处找到： https : //github.com/ddemidov/gpgpu_with_modern_cpp/blob/master/src/lorenz_ensemble/vexcl_lorenz_ensemble.cpp

I'm not sure what I am seeing here? 我不确定在这里看到什么？ Since the CPU results are so close to each other, it looks like an issue with the GPUs, but since I am an OpenCL newbie I need some pointers how to find the underlying cause of this. 由于CPU的结果是如此接近，因此看起来GPU似乎是一个问题，但是由于我是OpenCL新手，因此我需要一些指针来查找这种情况的根本原因。

Answer 1

You have to understand the GPUs have lower accuracy than CPUs. 您必须了解GPU的准确性要低于CPU。 This is usual since a GPU is designed for gaming, where exact values is not the design target. 这是很常见的，因为GPU是为游戏而设计的，而精确值不是设计目标。

Usually GPU accuracy is 32 bits. 通常，GPU精度为32位。 While CPUs have internally a 48 or 64 bits accuracy math, even if the result is then cut to 32 bits storage. 尽管CPU内部具有48或64位精度的数学运算，即使随后将结果削减为32位存储。

The operation you are running is heavily dependent on these small differences, creating different results for each device. 您正在运行的操作在很大程度上取决于这些小差异，从而为每个设备创建不同的结果。 For example this operation will as well create very different results based on accuracy: 例如，此操作也会基于准确性产生非常不同的结果：

a=1/(b-c); 
a=1/(b-c); //b = 1.00001, c = 1.00002  -> a = -100000
a=1/(b-c); //b = 1.0000098, c = 1.000021  -> a = -89285.71428

In you own results, you can see the different for each device, even for low R values: 在您自己的结果中，即使对于低R值，您也可以看到每个设备的不同之处：

5.644444 => -3.519254 -3.519250 4.644452
5.644444 => -3.519253 -3.519250 4.644451
5.644444 => -3.519254 -3.519251 4.644453

However you state "for low values the results agree up to R=16 , then start to diverge". 但是，您声明“对于低值，结果达到R=16 ，然后开始发散”。 Well, that depends, because they are not exactly equal, even for R=5.64 . 好吧，这取决于，因为即使对于R=5.64 ，它们也不完全相等。

Answer 2

I've created a stackoverflow-23805423 branch to test this. 我创建了一个stackoverflow-23805423分支来对此进行测试。 Below is the output for different devices. 以下是不同设备的输出。 Note that both CPUs and AMD GPU have consistent results. 请注意，CPU和AMD GPU均具有一致的结果。 Nvidia GPUs also have consistent results, only those are different. Nvidia GPU也具有一致的结果，只是不同之处。 This question seems to be related: IEEE-754 standard on NVIDIA GPU (sm_13) 这个问题似乎与以下问题有关： NVIDIA GPU（sm_13）上的IEEE-754标准

``` ```

1. Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz (Intel(R) OpenCL)

R = {
     0:  5.000000e+00  1.000000e+01  1.500000e+01  2.000000e+01  2.500000e+01
     5:  3.000000e+01  3.500000e+01  4.000000e+01  4.500000e+01  5.000000e+01
}

X = {
     0: ( -3.265986e+00 -3.265986e+00  4.000000e+00) (  4.898979e+00  4.898979e+00  9.000000e+00)
     2: (  6.110101e+00  6.110101e+00  1.400000e+01) ( -7.118047e+00 -7.118044e+00  1.900000e+01)
     4: (  9.392907e-01  1.679711e+00  1.455276e+01) (  5.351486e+00  1.051580e+01  9.403333e+00)
     6: ( -1.287673e+01 -2.096754e+01  2.790419e+01) ( -6.555650e-01 -2.142401e+00  2.721632e+01)
     8: (  2.711249e+00  2.540842e+00  3.259012e+01) ( -4.936437e+00  8.534876e-02  4.604861e+01)
}

1. Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz (AMD Accelerated Parallel Processing)

R = {
     0:  5.000000e+00  1.000000e+01  1.500000e+01  2.000000e+01  2.500000e+01
     5:  3.000000e+01  3.500000e+01  4.000000e+01  4.500000e+01  5.000000e+01
}

X = {
     0: ( -3.265986e+00 -3.265986e+00  4.000000e+00) (  4.898979e+00  4.898979e+00  9.000000e+00)
     2: (  6.110101e+00  6.110101e+00  1.400000e+01) ( -7.118047e+00 -7.118044e+00  1.900000e+01)
     4: (  9.392907e-01  1.679711e+00  1.455276e+01) (  5.351486e+00  1.051580e+01  9.403333e+00)
     6: ( -1.287673e+01 -2.096754e+01  2.790419e+01) ( -6.555650e-01 -2.142401e+00  2.721632e+01)
     8: (  2.711249e+00  2.540842e+00  3.259012e+01) ( -4.936437e+00  8.534876e-02  4.604861e+01)
}

1. Capeverde (AMD Accelerated Parallel Processing)

R = {
     0:  5.000000e+00  1.000000e+01  1.500000e+01  2.000000e+01  2.500000e+01
     5:  3.000000e+01  3.500000e+01  4.000000e+01  4.500000e+01  5.000000e+01
}

X = {
     0: ( -3.265986e+00 -3.265986e+00  4.000000e+00) (  4.898979e+00  4.898979e+00  9.000000e+00)
     2: (  6.110101e+00  6.110101e+00  1.400000e+01) ( -7.118047e+00 -7.118044e+00  1.900000e+01)
     4: (  9.392907e-01  1.679711e+00  1.455276e+01) (  5.351486e+00  1.051580e+01  9.403333e+00)
     6: ( -1.287673e+01 -2.096754e+01  2.790419e+01) ( -6.555650e-01 -2.142401e+00  2.721632e+01)
     8: (  2.711249e+00  2.540842e+00  3.259012e+01) ( -4.936437e+00  8.534876e-02  4.604861e+01)
}

1. Tesla C1060 (NVIDIA CUDA)

R = {
     0:  5.000000e+00  1.000000e+01  1.500000e+01  2.000000e+01  2.500000e+01
     5:  3.000000e+01  3.500000e+01  4.000000e+01  4.500000e+01  5.000000e+01
}

X = {
     0: ( -3.265986e+00 -3.265986e+00  4.000000e+00) (  4.898979e+00  4.898979e+00  9.000000e+00)
     2: (  6.110101e+00  6.110101e+00  1.400000e+01) ( -7.118047e+00 -7.118044e+00  1.900000e+01)
     4: (  7.636878e+00  2.252859e+00  2.964935e+01) (  1.373357e+01  8.995382e+00  3.998563e+01)
     6: (  7.163476e+00  8.802735e+00  2.839662e+01) ( -5.536365e+00 -5.997181e+00  3.191463e+01)
     8: ( -2.762679e+00 -5.167883e+00  2.324565e+01) (  2.776211e+00  4.734162e+00  2.949507e+01)
}

1. Tesla K20c (NVIDIA CUDA)

R = {
     0:  5.000000e+00  1.000000e+01  1.500000e+01  2.000000e+01  2.500000e+01
     5:  3.000000e+01  3.500000e+01  4.000000e+01  4.500000e+01  5.000000e+01
}

X = {
     0: ( -3.265986e+00 -3.265986e+00  4.000000e+00) (  4.898979e+00  4.898979e+00  9.000000e+00)
     2: (  6.110101e+00  6.110101e+00  1.400000e+01) ( -7.118047e+00 -7.118044e+00  1.900000e+01)
     4: (  7.636878e+00  2.252859e+00  2.964935e+01) (  1.373357e+01  8.995382e+00  3.998563e+01)
     6: (  7.163476e+00  8.802735e+00  2.839662e+01) ( -5.536365e+00 -5.997181e+00  3.191463e+01)
     8: ( -2.762679e+00 -5.167883e+00  2.324565e+01) (  2.776211e+00  4.734162e+00  2.949507e+01)
}

1. Tesla K40c (NVIDIA CUDA)

R = {
     0:  5.000000e+00  1.000000e+01  1.500000e+01  2.000000e+01  2.500000e+01
     5:  3.000000e+01  3.500000e+01  4.000000e+01  4.500000e+01  5.000000e+01
}

X = {
     0: ( -3.265986e+00 -3.265986e+00  4.000000e+00) (  4.898979e+00  4.898979e+00  9.000000e+00)
     2: (  6.110101e+00  6.110101e+00  1.400000e+01) ( -7.118047e+00 -7.118044e+00  1.900000e+01)
     4: (  7.636878e+00  2.252859e+00  2.964935e+01) (  1.373357e+01  8.995382e+00  3.998563e+01)
     6: (  7.163476e+00  8.802735e+00  2.839662e+01) ( -5.536365e+00 -5.997181e+00  3.191463e+01)
     8: ( -2.762679e+00 -5.167883e+00  2.324565e+01) (  2.776211e+00  4.734162e+00  2.949507e+01)
}

``` ```

带有odeint和VexCL的Lorenz示例在不同设备上产生不同的结果

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-05-22 12:20:36

解决方案2
0 2014-05-22 14:45:14

带有odeint和VexCL的Lorenz示例在不同设备上产生不同的结果

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-05-22 12:20:36

解决方案2 0 2014-05-22 14:45:14

解决方案1
1 已采纳 2014-05-22 12:20:36

解决方案2
0 2014-05-22 14:45:14