cuda python GPU numbapro 3d循环性能不佳

Question

I am trying to set up a 3D loop with the assignment 我正在尝试使用赋值设置3D循环

 C(i,j,k) = A(i,j,k) + B(i,j,k)

using Python on my GPU. 在我的GPU上使用Python。 This is my GPU: 这是我的GPU：

http://www.geforce.com/hardware/desktop-gpus/geforce-gt-520/specifications http://www.geforce.com/hardware/desktop-gpus/geforce-gt-520/specifications

The sources I'm looking at / comparing with are: 我正在寻找/比较的来源是：

http://nbviewer.ipython.org/gist/harrism/f5707335f40af9463c43 http://nbviewer.ipython.org/gist/harrism/f5707335f40af9463c43

http://nbviewer.ipython.org/github/ContinuumIO/numbapro-examples/blob/master/webinars/2014_06_17/intro_to_gpu_python.ipynb http://nbviewer.ipython.org/github/ContinuumIO/numbapro-examples/blob/master/webinars/2014_06_17/intro_to_gpu_python.ipynb

It's possible that I've imported more modules than necessary. 我可能导入的模块多于必要的模块。 This is my code: 这是我的代码：

import numpy as np
import numbapro
import numba
import math
from timeit import default_timer as timer
from numbapro import cuda
from numba import *

@autojit
def myAdd(a, b):
  return a+b

myAdd_gpu = cuda.jit(restype=f8, argtypes=[f8, f8], device=True)(myAdd)

@cuda.jit(argtypes=[float32[:,:,:], float32[:,:,:], float32[:,:,:]])
def myAdd_kernel(a, b, c):
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    tz = cuda.threadIdx.z
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    bz = cuda.blockIdx.z
    bw = cuda.blockDim.x
    bh = cuda.blockDim.y
    bd = cuda.blockDim.z
    i = tx + bx * bw
    j = ty + by * bh
    k = tz + bz * bd
    if i >= c.shape[0]:
      return
    if j >= c.shape[1]:
      return
    if k >= c.shape[2]:
      return
    for i in xrange(0,c.shape[0]):
      for j in xrange(0,c.shape[1]):
        for k in xrange(0,c.shape[2]):
          # c[i,j,k] = a[i,j,k] + b[i,j,k]
          c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])

def main():
    my_gpu = numba.cuda.get_current_device()
    print "Running on GPU:", my_gpu.name
    cores_per_capability = {1: 8,2: 32,3: 192,}
    cc = my_gpu.compute_capability
    print "Compute capability: ", "%d.%d" % cc, "(Numba requires >= 2.0)"
    majorcc = cc[0]
    print "Number of streaming multiprocessor:", my_gpu.MULTIPROCESSOR_COUNT
    cores_per_multiprocessor = cores_per_capability[majorcc]
    print "Number of cores per mutliprocessor:", cores_per_multiprocessor
    total_cores = cores_per_multiprocessor * my_gpu.MULTIPROCESSOR_COUNT
    print "Number of cores on GPU:", total_cores

    N = 100
    thread_ct = my_gpu.WARP_SIZE
    block_ct = int(math.ceil(float(N) / thread_ct))

    print "Threads per block:", thread_ct
    print "Block per grid:", block_ct

    a = np.ones((N,N,N), dtype = np.float32)
    b = np.ones((N,N,N), dtype = np.float32)
    c = np.zeros((N,N,N), dtype = np.float32)

    start = timer()
    cg = cuda.to_device(c)
    myAdd_kernel[block_ct, thread_ct](a,b,cg)
    cg.to_host()
    dt = timer() - start
    print "Wall clock time with GPU in %f s" % dt
    print 'c[:3,:,:] = ' + str(c[:3,1,1])
    print 'c[-3:,:,:] = ' + str(c[-3:,1,1])


if __name__ == '__main__':
    main()

My result from running this is the following: 运行此操作的结果如下：

Running on GPU: GeForce GT 520
Compute capability:  2.1 (Numba requires >= 2.0)
Number of streaming multiprocessor: 1
Number of cores per mutliprocessor: 32
Number of cores on GPU: 32
Threads per block: 32
Block per grid: 4
Wall clock time with GPU in 1.104860 s
c[:3,:,:] = [ 2.  2.  2.]
c[-3:,:,:] = [ 2.  2.  2.]

When I run the examples in the sources, I see significant speedup. 当我在源代码中运行示例时，我看到了显着的加速。 I don't think my example is running properly since the wall clock time is much longer than I would expect. 我不认为我的例子运行正常，因为挂钟时间比我预期的要长得多。 I've modeled this mostly from the "even bigger speedups with cuda python" section in the first example link. 我在第一个示例链接中主要通过“使用cuda python进行更大的加速”部分对此进行了建模。

I believe I've indexed correctly and safely. 我相信我已正确安全地编入索引。 Maybe the problem is with my blockdim? 也许问题出在我的blockdim上？ or griddim? 还是griddim？ Or maybe I'm using the wrong types for my GPU. 或许我的GPU使用了错误的类型。 I think I read that they must be a certain type. 我想我读过它们必定是某种类型。 I'm very new to this so the problem very well could be trivial! 我对此非常陌生，所以这个问题很可能是微不足道的！

Any and all help is greatly appreciated! 非常感谢任何和所有的帮助！

Answer 1

You are creating your indexes correctly but then you're ignoring them. 您正在正确地创建索引，但是您忽略了它们。 Running the nested loop 运行嵌套循环

for i in xrange(0,c.shape[0]):
    for j in xrange(0,c.shape[1]):
        for k in xrange(0,c.shape[2]):

is forcing all your threads to loop through all values in all dimensions, which is not what you want. 强制所有线程循环遍历所有维度中的所有值，这不是您想要的。 You want each thread to compute one value in a block and then move on. 您希望每个线程计算一个块中的一个值，然后继续。

I think something like this should work better... 我认为这样的事情会更好......

i = tx + bx * bw
while i < c.shape[0]:
    j = ty+by*bh
    while j < c.shape[1]:
        k = tz + bz * bd
        while k < c.shape[2]:
            c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])
            k+=cuda.blockDim.z*cuda.gridDim.z
        j+=cuda.blockDim.y*cuda.gridDim.y
    i+=cuda.blockDim.x*cuda.gridDim.x

Try to compile and run it. 尝试编译并运行它。 Also make sure to validate it, as I have not. 还要确保验证它，因为我没有。

Answer 2

I don't see you using imshow, or show, so there is no need to import those. 我没有看到你使用imshow或show，所以没有必要导入它们。

It doesn't appear as though you use your import of math (I didn't see any calls of math.some_function. 看起来好像你没有使用数学导入（我没有看到任何math.some_function的调用。

Your imports from numba and numbapro seem repetitive. 你从numba和numbapro的进口似乎是重复的。 Your "from numba import cuda" overrides your "from numbapro import cuda", since it is subsequent to it. 你的“来自numba import cuda”会覆盖你的“来自numbapro import cuda”，因为它是紧随其后的。 Your calls to cuda use the cuda in numba not numbapro. 你打电话给cuda使用numba中的cuda而不是numbapro。 When you call "from numba import *", you import everything from numba, not just cuda, which seems to be the only thing you use. 当你打电话给“来自numba import *”时，你从numba导入所有东西，而不仅仅是cuda，这似乎是你唯一使用的东西。 Also, (I believe) import numba.cuda is equivalent to from numba import cuda. 另外，（我相信）导入numba.cuda相当于来自numba import cuda。 Why not eliminate all your imports from numba and numbapro with a single "from numba import cuda". 为什么不用一个“来自numba import cuda”消除所有从numba和numbapro的进口。

cuda python GPU numbapro 3d循环性能不佳

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-01-04 14:10:03

解决方案2
0 2015-01-02 19:57:41

cuda python GPU numbapro 3d循环性能不佳

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-01-04 14:10:03

解决方案2 0 2015-01-02 19:57:41

解决方案1
3 已采纳 2015-01-04 14:10:03

解决方案2
0 2015-01-02 19:57:41