简体   繁体   中英

Dot product of two single-precision floating point vectors yields different results in CUDA kernel than on the host

While debugging some CUDA code I was comparing to equivalent CPU code using printf statements, and noticed that in some cases my results differed; they weren't necessarily wrong on either platform, as they were within floating point rounding errors, but I am still interested in knowing what gives rise to this difference.

I was able to track the problem down to differing dot product results. In both the CUDA and host code I have vectors a and b of type float4 . Then, on each platform, I compute the dot product and print the result, using this code:

printf("a: %.24f\t%.24f\t%.24f\t%.24f\n",a.x,a.y,a.z,a.w);
printf("b: %.24f\t%.24f\t%.24f\t%.24f\n",b.x,b.y,b.z,b.w);
float dot_product = a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w;
printf("a dot b: %.24f\n",dot_product);

and the resulting printout for the CPU is:

a: 0.999629139900207519531250   -0.024383276700973510742188 -0.012127066962420940399170 0.013238593004643917083740
b: -0.001840781536884605884552  0.033134069293737411499023  0.988499701023101806640625  1.000000000000000000000000
a dot b: -0.001397025771439075469971

and for the CUDA kernel:

a: 0.999629139900207519531250   -0.024383276700973510742188 -0.012127066962420940399170 0.013238593004643917083740
b: -0.001840781536884605884552  0.033134069293737411499023  0.988499701023101806640625  1.000000000000000000000000
a dot b: -0.001397024840116500854492

As you can see, the values for a and b seem to be bitwise equivalent on both platforms, but the result of the exact same code differs ever so slightly. It is my understanding that floating point multiplication is well-defined as per the IEEE 754 Standard and is hardware-independent. However, I do have two hypotheses as to why I am not seeing the same results:

  1. The compiler optimization is re-ordering the multiplications, and they happen in a different order on GPU/CPU, giving rise to different results.
  2. The CUDA kernel is using the fused multipl-add (FMA) operator, as described in http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf . In this case, the CUDA results should actually be a bit more accurate.

Except for merging FMUL and FADD into FMA (which can be turned off with the nvcc command line switch -fmad=false ), the CUDA compiler observes the evaluation order prescribed by C/C++. Depending on how your CPU code is compiled, it may use a wider precision than single precision to accumulate the dot product, which then yields a different result.

For GPU code, merging of FMUL/FADD into FMA is a common occurrence, so are the resulting numerical differences. The CUDA compiler performs aggressive FMA merging for performance reasons. Use of FMA usually also results in more accurate results, since the number of rounding steps is reduced, and there is some protection against subtractive cancellation as FMA maintains the full-width product internally. I would suggest reading the following whitepaper, as well as the references it cites:

https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

To get the CPU and GPU results to match for a sanity check, you would want to turn off FMA-merging in the GPU code with -fmad=false , and on the CPU enforce that each intermediate result is stored in single precision:

   volatile float p0,p1,p2,p3,dot_product; 
   p0=a.x*b.x; 
   p1=a.y*b.y; 
   p2=a.z*b.z; 
   p3=a.w*b.w; 
   dot_product=p0; 
   dot_product+=p1; 
   dot_product+=p2; 
   dot_product+=p3;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM