I want to evaluate a function at every point in a mesh. The trouble is, if I create the mesh on the CPU side, the act of transferring it to the GPU takes longer than the actual calculations. Can I generate the mesh on the GPU side?
The code below shows a creation of the mesh on the CPU side and evaluation of most of the expression on the GPU side (I wasn't sure how to get atan2 to work on the GPU, so I left it on the CPU side). I should apologize in advance and say that I'm still learning this stuff, so I'm sure there's a lot of room for improvement in the code below!
Thanks!
import math
from numba import vectorize, float64
import numpy as np
from time import time
@vectorize([float64(float64,float64,float64,float64)],target='cuda')
def a_cuda(lat1, lon1, lat2, lon2):
return (math.sin(0.008726645 * (lat2 - lat1))**2) + \
math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
def LLA_distance_numba_cuda(lat1, lon1, lat2, lon2):
a = a_cuda(np.ascontiguousarray(lat1), np.ascontiguousarray(lon1),
np.ascontiguousarray(lat2), np.ascontiguousarray(lon2))
return earthdiam_nm * np.arctan2(a,1-a)
# generate a mesh of one million evaluation points
nx, ny = 1000,1000
xv, yv = np.meshgrid(np.linspace(29, 31, nx), np.linspace(99, 101, ny))
X, Y = np.float64(xv.reshape(1,nx*ny).flatten()), np.float64(yv.reshape(1,nx*ny).flatten())
X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))
start = time()
LLA_distance_numba_cuda(X,Y,X2,Y2)
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
Let's establish a performance baseline. Adding a definition (1.0) for earthdiam_nm
, and running your code under nvprof
we have:
$ nvprof python t38.py
1000000 total evaluations in 0.581 seconds
(...)
==1973== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 55.58% 11.418ms 4 2.8544ms 2.6974ms 3.3044ms [CUDA memcpy HtoD]
28.59% 5.8727ms 1 5.8727ms 5.8727ms 5.8727ms cudapy::__main__::__vectorized_a_cuda$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>)
15.83% 3.2521ms 1 3.2521ms 3.2521ms 3.2521ms [CUDA memcpy DtoH]
(...)
So on my particular setup, the "kernel" itself runs in ~5.8ms on my (small, slow) QuadroK2000 GPU, and the data copy times are a total of 11.4ms for the 4 copies from host to device and 3.2ms for the results transfer back to host. The focus is on the 4 copies from host to device.
Let's go after the low-hanging fruit first. This line of code:
X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))
isn't really doing anything other than passing the values 30 and 101 to each "worker". I'm using "worker" here to refer to the idea of a particular scalar computation in the numba process of "broadcasting" the vectorize
function across a large data set. The numba vectorize/broadcast process doesn't require that each and every input be a data set of the same size, merely that the supplied data is "broadcast"-able. So it's possible to create a vectorize
ufunc that works on an array and a scalar, for example. That means each worker will use its element of the array, plus the scalar, to perform its computation.
Therefore the low-hanging fruit is to simply remove these two arrays and pass the values (30, 101) as scalars to the ufunc a_cuda
. While we are going after "low hanging fruit", let's incorporate your arctan2
calculation (replacing with math.atan2
) and your final scaling by earthdiam_nm
into the vectorize code, so we don't have to do it on the host in python/numpy:
$ cat t39.py
import math
from numba import vectorize, float64
import numpy as np
from time import time
earthdiam_nm = 1.0
@vectorize([float64(float64,float64,float64,float64,float64)],target='cuda')
def a_cuda(lat1, lon1, lat2, lon2, s):
a = (math.sin(0.008726645 * (lat2 - lat1))**2) + \
math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
return math.atan2(a, 1-a)*s
def LLA_distance_numba_cuda(lat1, lon1, lat2, lon2):
return a_cuda(np.ascontiguousarray(lat1), np.ascontiguousarray(lon1),
np.ascontiguousarray(lat2), np.ascontiguousarray(lon2), earthdiam_nm)
# generate a mesh of one million evaluation points
nx, ny = 1000,1000
xv, yv = np.meshgrid(np.linspace(29, 31, nx), np.linspace(99, 101, ny))
X, Y = np.float64(xv.reshape(1,nx*ny).flatten()), np.float64(yv.reshape(1,nx*ny).flatten())
# X2,Y2 = np.float64(np.array([30]*nx*ny)),np.float64(np.array([101]*nx*ny))
start = time()
Z=LLA_distance_numba_cuda(X,Y,30.0,101.0)
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
#print(Z)
$ nvprof python t39.py
==2387== NVPROF is profiling process 2387, command: python t39.py
1000000 total evaluations in 0.401 seconds
==2387== Profiling application: python t39.py
==2387== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 48.12% 8.4679ms 1 8.4679ms 8.4679ms 8.4679ms cudapy::__main__::__vectorized_a_cuda$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>)
33.97% 5.9774ms 5 1.1955ms 864ns 3.2535ms [CUDA memcpy HtoD]
17.91% 3.1511ms 4 787.77us 1.1840us 3.1459ms [CUDA memcpy DtoH]
(snip)
Now we see that the copy HtoD operations have been reduced from 11.4ms total to 5.6ms total. The kernel has grown from ~5.8ms to ~8.5ms because we are doing more work in the kernel, but the python reported execution time for the function has dropped from ~0.58s to ~0.4s.
Can we do better?
We can, but in order to do so (I believe) we'll need to use a different numba cuda method. The vectorize
method is convenient for scalar element-wise operations, but it has no way to know where in the overall data set the operation is being carried out. We need this information, and we can get it in the CUDA code, but we will need to switch to @cuda.jit
decorator to do so.
The following code converts the previous vectorize
a_cuda
function into a @cuda.jit
device function (with essentially no other changes), and then we create a CUDA kernel that does the mesh generation according to the supplied scalar parameters, and computes the result:
$ cat t40.py
import math
from numba import vectorize, float64, cuda
import numpy as np
from time import time
earthdiam_nm = 1.0
@cuda.jit(device='true')
def a_cuda(lat1, lon1, lat2, lon2, s):
a = (math.sin(0.008726645 * (lat2 - lat1))**2) + \
math.cos(0.01745329*(lat1)) * math.cos(0.01745329*(lat2)) * (math.sin(0.008726645 * (lon2 - lon1))**2)
return math.atan2(a, 1-a)*s
@cuda.jit
def LLA_distance_numba_cuda(lat2, lon2, xb, xe, yb, ye, s, nx, ny, out):
x,y = cuda.grid(2)
if x < nx and y < ny:
lat1 = (((xe-xb) * x)/(nx-1)) + xb # mesh generation
lon1 = (((ye-yb) * y)/(ny-1)) + yb # mesh generation
out[y][x] = a_cuda(lat1, lon1, lat2, lon2, s)
nx, ny = 1000,1000
Z = cuda.device_array((nx,ny), dtype=np.float64)
threads = (32,32)
blocks = (32,32)
start = time()
LLA_distance_numba_cuda[blocks,threads](30.0,101.0, 29.0, 31.0, 99.0, 101.0, earthdiam_nm, nx, ny, Z)
Zh = Z.copy_to_host()
print('{:d} total evaluations in {:.3f} seconds'.format(nx*ny,time()-start))
#print(Zh)
$ nvprof python t40.py
==2855== NVPROF is profiling process 2855, command: python t40.py
1000000 total evaluations in 0.294 seconds
==2855== Profiling application: python t40.py
==2855== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 75.60% 10.364ms 1 10.364ms 10.364ms 10.364ms cudapy::__main__::LLA_distance_numba_cuda$241(double, double, double, double, double, double, double, __int64, __int64, Array<double, int=2, A, mutable, aligned>)
24.40% 3.3446ms 1 3.3446ms 3.3446ms 3.3446ms [CUDA memcpy DtoH]
(...)
Now we see that:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.