Anaconda的NumbaPro CUDA断言错误

Question

I am trying to use NumbaPro's cuda extension to multiply large array matrixes. 我正在尝试使用NumbaPro的cuda扩展来增加大型阵列矩阵。 What I want in the end is to multiply a matrix of size NxN by a diagonal matrix that would be fed in as a 1D matrix (thus, a.dot(numpy.diagflat(b)) which I have found to be synonymous to a * b). 我最终想要的是将大小为NxN的矩阵乘以一个对角矩阵，该矩阵将作为一维矩阵输入（因此，a.dot（numpy.diagflat（b））我发现它是一个同义词* b）。 However, I am getting an assertion error that provides no information. 但是，我收到的断言错误没有提供任何信息。

I can only avoid this assertion error if I multiply two 1D array matrixes but that is not what I want to do. 如果我将两个1D阵列矩阵相乘，我只能避免这个断言错误，但这不是我想要做的。

from numbapro import vectorize, cuda
from numba import f4,f8
import numpy as np

def generate_input(n):
    import numpy as np
    A = np.array(np.random.sample((n,n)))
    B = np.array(np.random.sample(n) + 10)
    return A, B

def product(a, b):
    return a * b

def main():
    cu_product = vectorize([f4(f4, f4), f8(f8, f8)], target='gpu')(product)

    N = 1000

    A, B = generate_input(N)
    D = np.empty(A.shape)

    stream = cuda.stream()

    with stream.auto_synchronize():
        dA = cuda.to_device(A, stream)
        dB = cuda.to_device(B, stream)
        dD = cuda.to_device(D, stream, copy=False)
        cu_product(dA, dB, out=dD, stream=stream)
        dD.to_host(stream)

if __name__ == '__main__':
    main()

This is what my terminal spits out: 这是我的终端吐出来的：

Traceback (most recent call last):
  File "cuda_vectorize.py", line 32, in <module>
    main()
  File "cuda_vectorize.py", line 28, in main
    cu_product(dA, dB, out=dD, stream=stream)
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 109, in __call__
  File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 191, in _arguments_requirement
AssertionError

Answer 1

The problem is you are using vectorize on a function that takes non-scalar arguments. 问题是你在一个带有非标量参数的函数上使用了vectorize 。 The idea with NumbaPro's vectorize is that it takes a scalar function as input, and generates a function that applies the scalar operation in parallel to all the elements of a vector. 使用NumbaPro的vectorize的想法是它将标量函数作为输入，并生成一个函数，将标量运算并行应用于矢量的所有元素。 See the NumbaPro documentation . 请参阅NumbaPro文档。

Your function takes a matrix and a vector, which are definitely not scalar. 你的函数采用矩阵和向量，它们绝对不是标量。 [Edit] You can do what you want on the GPU using either NumbaPro's wrapper for cuBLAS, or by writing your own simple kernel function. [编辑]您可以使用NumbaPro的cuBLAS包装器或编写自己的简单内核函数，在GPU上执行您想要的操作。 Here's an example that demonstrates both. 这是一个演示两者的例子。 Note will need NumbaPro 0.12.2 or later (just released as of this edit). 注意将需要NumbaPro 0.12.2或更高版本（刚刚在此编辑时发布）。

from numbapro import jit, cuda
from numba import float32
import numbapro.cudalib.cublas as cublas
import numpy as np
from timeit import default_timer as timer

def generate_input(n):
    A = np.array(np.random.sample((n,n)), dtype=np.float32)
    B = np.array(np.random.sample(n), dtype=A.dtype)
    return A, B

@cuda.jit(argtypes=[float32[:,:], float32[:,:], float32[:]])
def diagproduct(c, a, b):
  startX, startY = cuda.grid(2)
  gridX = cuda.gridDim.x * cuda.blockDim.x;
  gridY = cuda.gridDim.y * cuda.blockDim.y;
  height, width = c.shape

  for y in range(startY, height, gridY):
    for x in range(startX, width, gridX):       
      c[y, x] = a[y, x] * b[x]

def main():

    N = 1000

    A, B = generate_input(N)
    D = np.empty(A.shape, dtype=A.dtype)
    E = np.zeros(A.shape, dtype=A.dtype)
    F = np.empty(A.shape, dtype=A.dtype)

    start = timer()
    E = np.dot(A, np.diag(B))
    numpy_time = timer() - start

    blas = cublas.api.Blas()

    start = timer()
    blas.gemm('N', 'N', N, N, N, 1.0, np.diag(B), A, 0.0, D)
    cublas_time = timer() - start

    diff = np.abs(D-E)
    print("Maximum CUBLAS error %f" % np.max(diff))

    blockdim = (32, 8)
    griddim  = (16, 16)

    start = timer()
    dA = cuda.to_device(A)
    dB = cuda.to_device(B)
    dF = cuda.to_device(F, copy=False)
    diagproduct[griddim, blockdim](dF, dA, dB)
    dF.to_host()
    cuda_time = timer() - start   

    diff = np.abs(F-E)
    print("Maximum CUDA error %f" % np.max(diff))

    print("Numpy took    %f seconds" % numpy_time)
    print("CUBLAS took   %f seconds, %0.2fx speedup" % (cublas_time, numpy_time / cublas_time)) 
    print("CUDA JIT took %f seconds, %0.2fx speedup" % (cuda_time, numpy_time / cuda_time))

if __name__ == '__main__':
    main()

The kernel is significantly faster because SGEMM does a full matrix-matrix multiply (O(n^3)), and expands the diagonal into a full matrix. 内核明显更快，因为SGEMM执行完整的矩阵 - 矩阵乘法（O（n ^ 3）），并将对角线扩展为完整矩阵。 The diagproduct function is smarter. diagproduct功能更智能。 It simply does a single multiply for each matrix element, and never expands the diagonal to a full matrix. 它只是对每个矩阵元素进行单次乘法运算，并且从不将对角线扩展为完整矩阵。 Here are the results on my NVIDIA Tesla K20c GPU for N=1000: 以下是我的NVIDIA Tesla K20c GPU上N = 1000的结果：

Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took    0.024535 seconds
CUBLAS took   0.010345 seconds, 2.37x speedup
CUDA JIT took 0.004857 seconds, 5.05x speedup

The timing includes all of the copies to and from the GPU, which is a significant bottleneck for small matrices. 时序包括GPU的所有副本，这是小型矩阵的一个重要瓶颈。 If we set N to 10,000 and run again, we get a much bigger speedup: 如果我们将N设置为10,000并再次运行，我们将获得更大的加速：

Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took    7.245677 seconds
CUBLAS took   1.371524 seconds, 5.28x speedup
CUDA JIT took 0.264598 seconds, 27.38x speedup

For very small matrices, however, CUBLAS SGEMM has an optimized path so it is closer to the CUDA performance. 但是，对于非常小的矩阵，CUBLAS SGEMM具有优化的路径，因此它更接近CUDA性能。 Here, N=100 这里，N = 100

Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took    0.006876 seconds
CUBLAS took   0.001425 seconds, 4.83x speedup
CUDA JIT took 0.001313 seconds, 5.24x speedup

Answer 2

Just to bounce back on all those considerations. 只是为了反复考虑所有这些因素。 I also wanted to implement some matrix computations on CUDA, but then heard about the numpy.einsum function. 我还想在CUDA上实现一些矩阵计算，但后来听说过numpy.einsum函数。 It turns out that einsum is incredibly fast. 事实证明，einsum非常快。 In a case like this, here is the code for it. 在这种情况下，这是它的代码。 But it can be applied to many types of computations. 但它可以应用于许多类型的计算。

G = np.einsum('ij,j -> ij',A, B)

In terms of speed, here are the results for N = 10000 在速度方面，这里是N = 10000的结果

Numpy took    8.387756 seconds
CUDA JIT took 0.218394 seconds, 38.41x speedup
EINSUM took 0.131751 seconds, 63.66x speedup

Anaconda的NumbaPro CUDA断言错误

问题描述

2 个解决方案

解决方案1
4 已采纳 2013-06-18 05:13:02

解决方案2
1 2014-10-09 18:34:52

Anaconda的NumbaPro CUDA断言错误

问题描述

2 个解决方案

解决方案1 4 已采纳 2013-06-18 05:13:02

解决方案2 1 2014-10-09 18:34:52

解决方案1
4 已采纳 2013-06-18 05:13:02

解决方案2
1 2014-10-09 18:34:52