简体   繁体   中英

non-linear optimization on the GPU (CUDA) without data transfer latency

I am trying to perform a non-linear optimization problem entirely on the GPU. Computation of the objective function and data transfer from the GPU to CPU are the bottlenecks. To solve this, I want to

  1. heavily parallelize computation of the objective and
  2. perform the entire optimization on the GPU.

More specifically, the problem is as follows in pseudo-code:

x = x0  // initial guess of the vector of unknowns, typically of size ~10,000
for iteration = 1 : max_iter
      D = compute_search_direction(x)
      alpha = compute_step_along_direction(x)
      x = x   +   D * alpha  // update
end for loop

The functions compute_search_direction(x) and compute_step_along_direction(x) both call the objective function f0(x) dozens of times per iteration. The objective function is a complicated CUDA kernel, basically it is a forward Bloch simulation (=the set of equations that describes the dynamics of nuclear spins in a magnetic field). The output of f0(x) are F (value of the objective function, scalar) and DF (Jacobian, or vector of first derivatives, with same size as x, ie ~10,000). On the GPU, f0(x) is really fast but transfer of x from the CPU to the GPU and then transfer back of F and DF from the GPU to the CPU takes a while (~1 second total). Because the function is called dozens of time per iteration, this leads to a pretty slow overall optimization.

Ideally, I would want to have the entire pseudo code above on the GPU. The only solution I can think of now is recursive kernels. The pseudo code above would be the "outer kernel", launched with a number of threads = 1 and a number of blocks = 1 (ie, this kernel is not really parallel...). This kernel would then call the objective function (ie, the "inner kernel", this one massively parallel) every time it needs to evaluate the objective function and the vector of first derivatives. Since kernel launches are asynchronous, I can force the GPU to wait until the f0 inner kernel is fully evaluated to move to the next instruction of the outer kernel (using a synchronization point).

In a sense, this is really the same as regular CUDA programming where the CPU controls kernel launches for evaluation of the objective function f0 , except the CPU is replaced by an outer kernel that is not parallelzied (1 thread, 1 block). However, since everything is on the GPU, there is no data transfer latency anymore.

I am testing the idea now on a simple example to test feasibility. However, this seems quite cumbersome... My questions are:

  1. Does this make any sense to anyone else?
  2. Is there a more direct way to achieve the same result without the added complexity of nested kernels?

It seems you are mixing up "reducing memory transfer between GPU and CPU", and "having the entire code run on device (aka. on gpu)".

In order to reduce memory transfers, you do not need to have the entire code run on GPU.

You can copy your data to the GPU once, and then switch back and forth between GPU code and CPU code. As long as you don't try to access any GPU memory from your CPU code (and vice-versa), you should be fine.

Here's a pseudo-code of a correct approach for what you want to do.

// CPU code
cudaMalloc(&x,...) //allocate memory for x on GPU
cudaMemCpy(x, x0, size, cudaMemCpyHostToDevice); //Copy x0 to the freshly allocated array 
cudaMalloc(&D, ....)    //allocate D and alpha before the loop
cudaMalloc(&alpha, ....)
for iteration = 1 : max_iter
      compute_search_direction<<<...>>>(x, D) //Call a kernel that does the computation and stores the result in D
      compute_step_along_direction<<<....>>>(x, alpha)
      combine_result<<<...>>>(x, D, alpha)  // x   +   D * alpha
end for loop
//Eventually copy x on CPU, if need be

Hope it helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM