I am practicing numba & cuda programming. I tried to sum with cuda an array of ones. The sum is not correct. I think that there must be something with synchronizing and collecting at the end the data correctly.
@cuda.jit
def my_kernel(const_array, res_array):
sbuf = cuda.shared.array(512, float32)
# Thread id in a 1D block
tx = cuda.threadIdx.x
# Block id in a 1D grid
ty = cuda.blockIdx.x
# Block width, i.e. number of threads per block
bw = cuda.blockDim.x
# Compute flattened index inside the array
pos = tx + ty * bw
sbuf[tx] = 0
if pos < const_array.shape[0]:
sbuf[tx] = const_array[pos] # do the computation
cuda.syncthreads()
if cuda.threadIdx.x == 0:
for i in range(bw):
res_array[0] += sbuf[i]
return
data_size = 10000000
res = numpy.zeros(1, dtype=numpy.float64)
const_array = numpy.ones(data_size, dtype=numpy.int8)
threadsperblock = 512
blockspergrid = math.ceil(data_size / threadsperblock)
my_kernel[blockspergrid, threadsperblock](const_array, res)
print(res)
Every time I run this code it retrieves different value, eg 28160.0, but of course it must be 10m.
And hint?
the problem seems to be that you are are not summing through the whole set of blocks. You have a vector dimension of 10000000 and 512 threads, which means you need to sum over all blocks the 19532 blocks. This is achieved in standard CUDA language by either launching multiple kernels ( mostly for older devices ) or by using atomic operations. Specifically, your problem is in this part of your code:
if pos < const_array.shape[0]:
sbuf[tx] = const_array[pos] # do the computation cuda.syncthreads()
if cuda.threadIdx.x == 0:
for i in range(bw):
res_array[0] += sbuf[i]
In the first two lines, you are copying the data from the global to the shared memory of array sbuf. But then, all threads in different blocks are simultaneously trying to add their local data into the global memory address at res_array, this is not sequential and different threads may just read the same data twice and give you wrong results. The solution is to first perform a partial sum in the shared memory, then perform an atomic sum to avoid asynchronous read-write operations
if cuda.threadIdx.x == 0:
sum = 0
for i in range(bw):
sum += sbuf[i]
cuda.atomic.add(res_array, 0, sum)
that should solve your problem.
Regards.
Firstly the summing logic doesn't make sense at all and is very inefficient. The problem is that you are trying to write to single memory location from different threads in different blocks and it results in a race condition. You should use cuda.atomic.add
to avoid race condition. You can read more in CUDA programming guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.