简体   繁体   中英

Numba cuda: why the sum of the 1D array is not right?

I am practicing numba & cuda programming. I tried to sum with cuda an array of ones. The sum is not correct. I think that there must be something with synchronizing and collecting at the end the data correctly.

 @cuda.jit
def my_kernel(const_array, res_array):

    sbuf = cuda.shared.array(512, float32)

    # Thread id in a 1D block
    tx = cuda.threadIdx.x
    # Block id in a 1D grid
    ty = cuda.blockIdx.x
    # Block width, i.e. number of threads per block
    bw = cuda.blockDim.x
    # Compute flattened index inside the array
    pos = tx + ty * bw

    sbuf[tx] = 0

    if pos < const_array.shape[0]:

        sbuf[tx] = const_array[pos] # do the computation

    cuda.syncthreads()
    if cuda.threadIdx.x == 0:
        for i in range(bw):
            res_array[0] += sbuf[i] 


    return


data_size = 10000000
res = numpy.zeros(1, dtype=numpy.float64)
const_array = numpy.ones(data_size, dtype=numpy.int8)

threadsperblock = 512
blockspergrid = math.ceil(data_size / threadsperblock)

my_kernel[blockspergrid, threadsperblock](const_array, res)

print(res)        

Every time I run this code it retrieves different value, eg 28160.0, but of course it must be 10m.

And hint?

the problem seems to be that you are are not summing through the whole set of blocks. You have a vector dimension of 10000000 and 512 threads, which means you need to sum over all blocks the 19532 blocks. This is achieved in standard CUDA language by either launching multiple kernels ( mostly for older devices ) or by using atomic operations. Specifically, your problem is in this part of your code:

if pos < const_array.shape[0]:
    sbuf[tx] = const_array[pos] # do the computation    cuda.syncthreads()
if cuda.threadIdx.x == 0:
    for i in range(bw):
        res_array[0] += sbuf[i] 

In the first two lines, you are copying the data from the global to the shared memory of array sbuf. But then, all threads in different blocks are simultaneously trying to add their local data into the global memory address at res_array, this is not sequential and different threads may just read the same data twice and give you wrong results. The solution is to first perform a partial sum in the shared memory, then perform an atomic sum to avoid asynchronous read-write operations

if cuda.threadIdx.x == 0:
    sum = 0
    for i in range(bw):
        sum  += sbuf[i] 
    cuda.atomic.add(res_array, 0, sum)

that should solve your problem.

Regards.

Firstly the summing logic doesn't make sense at all and is very inefficient. The problem is that you are trying to write to single memory location from different threads in different blocks and it results in a race condition. You should use cuda.atomic.add to avoid race condition. You can read more in CUDA programming guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM