简体   繁体   中英

Cuda math vs C++ math

I implemented the same algorithm on CPU using C++ and on GPU using CUDA. In this algorithm I have to solve an integral numerically, since there are no analytic answer to it. The function I have to integrate is a weird polynomial of a curve and at the end there is an exp function.

In C++

for(int l = 0; l < 200; l++)
{
    integral = integral + (a0*(1/(r_int*r_int)) + a1*(1/r_int) + a2 + a3*r_int + a4*r_int*r_int + a5*r_int*r_int*r_int)*exp(-a6*r_int)*step;
    r_int = r_int + step;
}

In CUDA

for(int l = 0; l < 200; l++)
{
    integral = integral + (a0*(1/(r_int*r_int)) + a1*(1/r_int) + a2 + a3*r_int + a4*r_int*r_int + a5*r_int*r_int*r_int)*__expf(-a6*r_int)*step;
    r_int = r_int + step;
}

Output:

CPU: dose_output=0.00165546

GPU: dose_output=0.00142779

I think that the exp function of math.h and the __expf function of CUDA are not calculating the same thing. I tried to remove the --use_fast_math compiler flag thinking that it was the cause, but it seems that both implementation are diverging by around 20%.

I'm using CUDA to accelerate medical physics algorithms and these kind of differences are not very good since I have to proove that one of the outputs is "more true" than the other, and it could obviously be catastrophic for patients.

Does the difference comes from the function itself? Otherwise, I'm thinking that it might come from the memcopy of the a_i factors or the way I fetch them.

Edit: "Complete" code

float a0 = 5.9991e-04;
float a1 = -1.4694e-02;
float a2 = 1.1588;
float a3 = 4.5675e-01;
float a4 = -3.8617e-03;
float a5 = 3.2066e-03;
float a6 = 4.7050e-01;

float integral = 0.0;

float r_int = 5.0;
float step = 0.1/200;

for(int l = 0; l < 200; l++)
{
    integral = integral + (a0*(1/(r_int*r_int)) + a1*(1/r_int) + a2 + a3*r_int + a4*r_int*r_int + a5*r_int*r_int*r_int)*exp(-a6*r_int)*step;
    r_int = r_int + step;
}

cout << "Integral=" << integral << endl; 

I would suggest running this part both on a gpu and a cpu. Values from Carleton's seed database

You are using the less accurate implementation of exp() from the CUDA API.

Basically you could use three versions of exp() on the device :

  • exp(), the more accurate one
  • expf(), which is a single-precision "equivalent"
  • __expf(), which is an intrinsic version of the previous one, and the less accurate

You can read more about the different implementations of mathematical functions, including double-precision, single-precision and intrinsic versions, in the Mathematical Functions Appendix of the CUDA documentation :

D.2. Intrinsic Functions

The functions from this section can only be used in device code.

Among these functions are the less accurate, but faster versions of some of the functions of Standard Functions .They have the same name prefixed with __ (such as __sinf(x)). They are faster as they map to fewer native instructions.

In the same page you will read that the compiler option you removed justs prevent every function from being replaced by its intrinsic version. As you explicitely use an intrinsic version of exp(), removing this flag has no change for you :

The compiler has an option (-use_fast_math) that forces each function in Table 8 to compile to its intrinsic counterpart.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM