Creating arrays on the GPU with numba in python using Cuda

Question

I want to evaluate a function at every point in a mesh. The trouble is, if I create the mesh on the CPU side, the act of transferring it to the GPU takes longer than the actual calculations. Can I generate the mesh on the GPU side?

The code below shows a creation of the mesh on the CPU side and evaluation of most of the expression on the GPU side (I wasn't sure how to get atan2 to work on the GPU, so I left it on the CPU side). I should apologize in advance and say that I'm still learning this stuff, so I'm sure there's a lot of room for improvement in the code below!

Thanks!

import math
from numba import vectorize, float64
import numpy as np
from time import time

@vectorize([float64(float64,float64,float64,float64)],target='cuda')
def a_cuda(lat1, lon1, lat2, lon2):
    return  (math.sin(0.008726645 * (lat2 - lat1))**2) + \
             math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)

def LLA_distance_numba_cuda(lat1, lon1, lat2, lon2):
    a = a_cuda(np.ascontiguousarray(lat1), np.ascontiguousarray(lon1), 
               np.ascontiguousarray(lat2), np.ascontiguousarray(lon2))
    return earthdiam_nm * np.arctan2(a,1-a)

# generate a mesh of one million evaluation points
nx, ny = 1000,1000
xv, yv = np.meshgrid(np.linspace(29, 31, nx), np.linspace(99, 101, ny))
X, Y = np.float64(xv.reshape(1,nx*ny).flatten()), np.float64(yv.reshape(1,nx*ny).flatten())
X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))

start = time()
LLA_distance_numba_cuda(X,Y,X2,Y2)
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))

Answer 1

Let's establish a performance baseline. Adding a definition (1.0) for earthdiam_nm , and running your code under nvprof we have:

$ nvprof python t38.py
1000000 total evaluations in 0.581 seconds
(...)
==1973== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   55.58%  11.418ms         4  2.8544ms  2.6974ms  3.3044ms  [CUDA memcpy HtoD]
                   28.59%  5.8727ms         1  5.8727ms  5.8727ms  5.8727ms  cudapy::__main__::__vectorized_a_cuda$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>)
                   15.83%  3.2521ms         1  3.2521ms  3.2521ms  3.2521ms  [CUDA memcpy DtoH]
(...)

So on my particular setup, the "kernel" itself runs in ~5.8ms on my (small, slow) QuadroK2000 GPU, and the data copy times are a total of 11.4ms for the 4 copies from host to device and 3.2ms for the results transfer back to host. The focus is on the 4 copies from host to device.

Let's go after the low-hanging fruit first. This line of code:

X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))

isn't really doing anything other than passing the values 30 and 101 to each "worker". I'm using "worker" here to refer to the idea of a particular scalar computation in the numba process of "broadcasting" the vectorize function across a large data set. The numba vectorize/broadcast process doesn't require that each and every input be a data set of the same size, merely that the supplied data is "broadcast"-able. So it's possible to create a vectorize ufunc that works on an array and a scalar, for example. That means each worker will use its element of the array, plus the scalar, to perform its computation.

Therefore the low-hanging fruit is to simply remove these two arrays and pass the values (30, 101) as scalars to the ufunc a_cuda . While we are going after "low hanging fruit", let's incorporate your arctan2 calculation (replacing with math.atan2 ) and your final scaling by earthdiam_nm into the vectorize code, so we don't have to do it on the host in python/numpy:

$ cat t39.py
import math
from numba import vectorize, float64
import numpy as np
from time import time
earthdiam_nm = 1.0
@vectorize([float64(float64,float64,float64,float64,float64)],target='cuda')
def a_cuda(lat1, lon1, lat2, lon2, s):
    a = (math.sin(0.008726645 * (lat2 - lat1))**2) + \
             math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
    return math.atan2(a, 1-a)*s

def LLA_distance_numba_cuda(lat1, lon1, lat2, lon2):
    return a_cuda(np.ascontiguousarray(lat1), np.ascontiguousarray(lon1),
               np.ascontiguousarray(lat2), np.ascontiguousarray(lon2), earthdiam_nm)

# generate a mesh of one million evaluation points
nx, ny = 1000,1000
xv, yv = np.meshgrid(np.linspace(29, 31, nx), np.linspace(99, 101, ny))
X, Y = np.float64(xv.reshape(1,nx*ny).flatten()), np.float64(yv.reshape(1,nx*ny).flatten())
# X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))
start = time()
Z=LLA_distance_numba_cuda(X,Y,30.0,101.0)
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
#print(Z)
$ nvprof python t39.py
==2387== NVPROF is profiling process 2387, command: python t39.py
1000000 total evaluations in 0.401 seconds
==2387== Profiling application: python t39.py
==2387== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   48.12%  8.4679ms         1  8.4679ms  8.4679ms  8.4679ms  cudapy::__main__::__vectorized_a_cuda$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>)
                   33.97%  5.9774ms         5  1.1955ms     864ns  3.2535ms  [CUDA memcpy HtoD]
                   17.91%  3.1511ms         4  787.77us  1.1840us  3.1459ms  [CUDA memcpy DtoH]
(snip)

Now we see that the copy HtoD operations have been reduced from 11.4ms total to 5.6ms total. The kernel has grown from ~5.8ms to ~8.5ms because we are doing more work in the kernel, but the python reported execution time for the function has dropped from ~0.58s to ~0.4s.

Can we do better?

We can, but in order to do so (I believe) we'll need to use a different numba cuda method. The vectorize method is convenient for scalar element-wise operations, but it has no way to know where in the overall data set the operation is being carried out. We need this information, and we can get it in the CUDA code, but we will need to switch to @cuda.jit decorator to do so.

The following code converts the previous vectorize a_cuda function into a @cuda.jit device function (with essentially no other changes), and then we create a CUDA kernel that does the mesh generation according to the supplied scalar parameters, and computes the result:

$ cat t40.py
import math
from numba import vectorize, float64, cuda
import numpy as np
from time import time

earthdiam_nm = 1.0

@cuda.jit(device='true')
def a_cuda(lat1, lon1, lat2, lon2, s):
    a = (math.sin(0.008726645 * (lat2 - lat1))**2) + \
             math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
    return math.atan2(a, 1-a)*s

@cuda.jit
def LLA_distance_numba_cuda(lat2, lon2, xb, xe, yb, ye, s, nx, ny, out):
    x,y = cuda.grid(2)
    if x < nx and y < ny:
        lat1 = (((xe-xb) * x)/(nx-1)) + xb # mesh generation
        lon1 = (((ye-yb) * y)/(ny-1)) + yb # mesh generation
        out[y][x] = a_cuda(lat1, lon1, lat2, lon2, s)

nx, ny = 1000,1000
Z = cuda.device_array((nx,ny), dtype=np.float64)
threads = (32,32)
blocks = (32,32)
start = time()
LLA_distance_numba_cuda[blocks,threads](30.0,101.0, 29.0, 31.0, 99.0, 101.0, earthdiam_nm, nx, ny, Z)
Zh = Z.copy_to_host()
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
#print(Zh)
$ nvprof python t40.py
==2855== NVPROF is profiling process 2855, command: python t40.py
1000000 total evaluations in 0.294 seconds
==2855== Profiling application: python t40.py
==2855== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   75.60%  10.364ms         1  10.364ms  10.364ms  10.364ms  cudapy::__main__::LLA_distance_numba_cuda$241(double, double, double, double, double, double, double, __int64, __int64, Array<double, int=2, A, mutable, aligned>)
                   24.40%  3.3446ms         1  3.3446ms  3.3446ms  3.3446ms  [CUDA memcpy DtoH]
(...)

Now we see that:

The kernel runtime is even longer at about 10ms (because we are doing the mesh generation)
There is no explicit copying of data from host to device
The overall function runtime has been reduced from ~0.4s to ~0.3s

Creating arrays on the GPU with numba in python using Cuda

Question

1 answers

solution1
2 ACCPTED 2019-03-08 22:31:00

Creating arrays on the GPU with numba in python using Cuda

Question

1 answers

solution1 2 ACCPTED 2019-03-08 22:31:00

solution1
2 ACCPTED 2019-03-08 22:31:00