[英]Converting function to NumbaPro CUDA
I am comparing several Python modules/extensions or methods for achieving the following: 我正在比较几个Python模块/扩展或方法来实现以下目标:
import numpy as np
def fdtd(input_grid, steps):
grid = input_grid.copy()
old_grid = np.zeros_like(input_grid)
previous_grid = np.zeros_like(input_grid)
l_x = grid.shape[0]
l_y = grid.shape[1]
for i in range(steps):
np.copyto(previous_grid, old_grid)
np.copyto(old_grid, grid)
for x in range(l_x):
for y in range(l_y):
grid[x,y] = 0.0
if 0 < x+1 < l_x:
grid[x,y] += old_grid[x+1,y]
if 0 < x-1 < l_x:
grid[x,y] += old_grid[x-1,y]
if 0 < y+1 < l_y:
grid[x,y] += old_grid[x,y+1]
if 0 < y-1 < l_y:
grid[x,y] += old_grid[x,y-1]
grid[x,y] /= 2.0
grid[x,y] -= previous_grid[x,y]
return grid
This function is a very basic implementation of the Finite-Difference Time Domain (FDTD) method. 该函数是有限差分时域(FDTD)方法的非常基本的实现。 I've implemented this function several ways:
我已经用几种方式实现了这个功能:
Now I would like to compare the performance with NumbaPro CUDA. 现在我想将性能与NumbaPro CUDA进行比较。
This is the first time I am writing code for CUDA and I came up with the code below. 这是我第一次为CUDA编写代码,我想出了下面的代码。
from numbapro import cuda, float32, int16
import numpy as np
@cuda.jit(argtypes=(float32[:,:], float32[:,:], float32[:,:], int16, int16, int16))
def kernel(grid, old_grid, previous_grid, steps, l_x, l_y):
x,y = cuda.grid(2)
for i in range(steps):
previous_grid[x,y] = old_grid[x,y]
old_grid[x,y] = grid[x,y]
for i in range(steps):
grid[x,y] = 0.0
if 0 < x+1 and x+1 < l_x:
grid[x,y] += old_grid[x+1,y]
if 0 < x-1 and x-1 < l_x:
grid[x,y] += old_grid[x-1,y]
if 0 < y+1 and y+1 < l_x:
grid[x,y] += old_grid[x,y+1]
if 0 < y-1 and y-1 < l_x:
grid[x,y] += old_grid[x,y-1]
grid[x,y] /= 2.0
grid[x,y] -= previous_grid[x,y]
def fdtd(input_grid, steps):
grid = cuda.to_device(input_grid)
old_grid = cuda.to_device(np.zeros_like(input_grid))
previous_grid = cuda.to_device(np.zeros_like(input_grid))
l_x = input_grid.shape[0]
l_y = input_grid.shape[1]
kernel[(16,16),(32,8)](grid, old_grid, previous_grid, steps, l_x, l_y)
return grid.copy_to_host()
Unfortunately I get the following error: 不幸的是我收到以下错误:
File ".../fdtd_numbapro.py", line 98, in fdtd
return grid.copy_to_host()
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/devicearray.py", line 142, in copy_to_host
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/driver.py", line 1702, in device_to_host
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/cudadrv/driver.py", line 772, in check_error
numbapro.cudadrv.error.CudaDriverError: CUDA_ERROR_LAUNCH_FAILED
Failed to copy memory D->H
I've used grid.to_host() as well and that would work neither. 我也使用过grid.to_host(),但两者都没用。 CUDA is definitely working using NumbaPro on this system.
CUDA肯定在这个系统上使用NumbaPro。
The problem is resolved by the user. 问题由用户解决。 I am cross-referencing the discussion on Anaconda mailing list for this problem: https://groups.google.com/a/continuum.io/forum/#!searchin/anaconda/fdtd/anaconda/VgiN4h37UrA/18tAc60EIkcJ
我正在针对此问题交叉引用Anaconda邮件列表上的讨论: https ://groups.google.com/a/continuum.io/forum/#!searchin/anaconda/fdtd/anaconda/VgiN4h37UrA/ 18tAc60EIkcJ
I made some minor modifications to your original code to get it running in Parakeet : 我对您的原始代码进行了一些小修改,以使其在Parakeet中运行:
1) Split compound comparisons such as "0 < x-1 < l_x" into "0 < x-1 and x-1 < l_x". 1)将诸如“0 <x-1 <l_x”的化合物比较分成“0 <x-1和x-1 <1_x”。
2) Replaced np.copyto with explicit indexed assignment (previous_grid[:,:] = old_grid). 2)用显式索引赋值替换np.copyto (previous_grid [:,:] = old_grid)。
After that, I compare the Parakeet runtimes for the C, OpenMP and CUDA backends against the original Python time and Numba's autojit on a 1000x1000 grid with steps = 20. 之后,我将C,OpenMP和CUDA后端的Parakeet运行时与原始Python时间和Numba的autojit在1000x1000网格上进行比较,步长为20。
Parakeet (backend = c) cold: fdtd : 0.5590s
Parakeet (backend = c) warm: fdtd : 0.1260s
Parakeet (backend = openmp) cold: fdtd : 0.4317s
Parakeet (backend = openmp) warm: fdtd : 0.1693s
Parakeet (backend = cuda) cold: fdtd : 2.6357s
Parakeet (backend = cuda) warm: fdtd : 0.2455s
Numba (autojit) cold: 672.3666s
Numba (autojit) warm: 657.8858s
Python: 203.3907s
Since there is little readily available parallelism in your code, the parallel backends actually do worse than the sequential one. 由于代码中几乎没有可用的并行性,因此并行后端实际上比顺序后端更差。 This is largely due to a difference in which loop optimizations get run by Parakeet for each backend, along with some extra overheads associated with CUDA memory transfers and starting OpenMP thread groups.
这主要是由于Parakeet为每个后端运行循环优化的不同,以及与CUDA内存传输和启动OpenMP线程组相关的一些额外开销。 I'm not sure why Numba's autojit is so slow here, I'm sure it would be faster with type annotations or using NumbaPro.
我不确定为什么Numba的autojit在这里很慢,我相信使用类型注释或使用NumbaPro会更快。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.