为什么我不能用numba（cuda python）得到正确的一维数组总和？

Question

I try to use cuda python with numba. 我尝试将numa与cuda python配合使用。 The code is to calculate the sum of a 1D array as follows, but I don't know how to get one value result rather than three values. 代码是按如下方法计算一维数组的总和，但我不知道如何获得一个值而不是三个值。

python3.5 with numba + CUDA8.0 带有numba + CUDA8.0的python3.5

import os,sys,time
import pandas as pd
import numpy as np
from numba import cuda, float32

os.environ['NUMBAPRO_NVVM']=r'D:\NVIDIA GPU Computing Toolkit\CUDA\v8.0\nvvm\bin\nvvm64_31_0.dll'
os.environ['NUMBAPRO_LIBDEVICE']=r'D:\NVIDIA GPU Computing Toolkit\CUDA\v8.0\nvvm\libdevice'

bpg = (1,1) 
tpb = (1,3) 

@cuda.jit
def calcu_sum(D,T):
    ty = cuda.threadIdx.y
    bh = cuda.blockDim.y
    index_i = ty
    L = len(D)
    su = 0
    while index_i<L:
        su +=D[index_i]
        index_i +=bh
    print('su:',su)
    T[0,0]=su
    print('T:',T[0,0])


D = np.array([ 0.42487645,0.41607881,0.42027071,0.43751907,0.43512794,0.43656972,
               0.43940639,0.43864551,0.43447691,0.43120232], dtype=np.float32)
T = np.empty([1,1])
print('D: ',D)

stream = cuda.stream()
with stream.auto_synchronize():
    dD = cuda.to_device(D, stream)
    dT= cuda.to_device(TE, stream)
    calcu_sum[bpg, tpb, stream](dD,dT)

The output is: 输出为：

D:  [ 0.42487645  0.41607881  0.42027071  0.43751907  0.43512794  0.43656972
  0.43940639  0.43864551  0.43447691  0.43120232]
su:  1.733004
su:  1.289852
su:  1.291317
T: 1.733004
T: 1.289852
T: 1.291317

Why can't I get the output "4.31417383" rather than "1.733004 1.289852 1.291317" ? 为什么我无法获得输出“ 4.31417383”而不是“ 1.733004 1.289852 1.291317”？ 1.733004+1.289852+1.291317=4.314173. 1.733004 + 1.289852 + 1.291317 = 4.314173。

I'm new to numba, read the numba documentation, but don't know how to do it. 我是numba的新手，请阅读numba文档，但不知道该怎么做。 Can someone give advice ? 有人可以提供建议吗？

Answer 1

The reason you don't get the sum you expect is because you haven't written code to produce that sum. 您没有获得期望总和的原因是因为您尚未编写代码来产生该总和。

The basic CUDA programming model (whether you use CUDA C, Fortran or Python as your language) is that you write kernel code which is executed by each thread. 基本的CUDA编程模型（无论您使用CUDA C，Fortran还是Python作为语言）都是编写由每个线程执行的内核代码。 You have written code for each thread to read and sum part of the input array. 您已经为每个线程编写了代码，以读取和求和输入数组的一部分。 You have not written any code for those threads to share and sum their individual partial sums into a final sum. 您尚未编写任何代码来共享这些线程，并将它们的各个部分总和求和为最终总和。

There is an extremely well described algorithm for doing this -- it is called a parallel reduction. 有一个非常好的描述算法可以做到这一点-称为并行约简。 You can find an introduction to the algorithm in a PDF which ships in the examples of every version of the CUDA toolkit, or download a presentation about it here . 您可以在CUDA工具包的每个版本的示例中附带的PDF中找到该算法的简介，或在此处下载有关该算法的演示文稿。 You can also read a more modern version of the algorithm which uses newer features of CUDA (warp shuffle instructions and atomic transactions) here . 您还可以阅读其采用CUDA的新功能（经整理指令和原子交易）的算法更现代的版本在这里。

After you have studied the reduction algorithm, you will need to adapt the standard CUDA C kernel code into the Numba Python kernel dialect. 研究完归约算法后，您需要将标准CUDA C内核代码改编为Numba Python内核方言。 At the bare minimum, something like this: 至少，像这样：

tpb = (1,3) 

@cuda.jit
def calcu_sum(D,T):

    ty = cuda.threadIdx.y
    bh = cuda.blockDim.y
    index_i = ty
    sbuf = cuda.shared.array(tpb, float32)

    L = len(D)
    su = 0
    while index_i < L:
        su += D[index_i]
        index_i +=bh

    print('su:',su)

    sbuf[0,ty] = su
    cuda.syncthreads()

    if ty == 0:
        T[0,0] = 0
        for i in range(0, bh):
            T[0,0] += sbuf[0,i]
        print('T:',T[0,0])

will probably do what you want, although it is still a long way from an optimal parallel shared memory reduction, as you will see when you read the material I provided links to. 尽管距离最佳并行共享内存的减少还有很长一段路要走，但是阅读我提供的链接资料时会看到，尽管这样做还有很长的路要走，但它可能会满足您的要求。

为什么我不能用numba（cuda python）得到正确的一维数组总和？

问题描述

1 个解决方案

解决方案1
1 已采纳

为什么我不能用numba（cuda python）得到正确的一维数组总和？

问题描述

1 个解决方案

解决方案1 1 已采纳

解决方案1
1 已采纳