简体   繁体   中英

Numba - Shared memory in CUDA kernel not updating correctly

Consider the following kernel, which counts the number of elements in x which are less than or equal to the corresponding element in y .

@cuda.jit
def count_leq(x, y, out):
    i = cuda.grid(1)
    shared = cuda.shared.array(1, dtype=DTYPE)
    if i < len(x):
        shared[0] += x[i] <= y[i]
    cuda.syncthreads()
    out[0] = shared[0]
    

However, the increments from each thread are not being saved properly in the shared array.

a = cuda.to_device(np.arange(5))  # [0 1 2 3 4]
b = cuda.to_device(np.arange(5))  # [0 1 2 3 4]
out = cuda.to_device(np.zeros(1)) # [0]
count_leq[1,len(a)](a, b, out)
print(out[0])                     # 1.0, but should be 5.0

What am I doing wrong here? I'm confused because cuda.shared.array is shared by all threads in a given block, right? How do I accumulate the increments using the same 1-element array?

I also tried the following, which failed with the same behavior as the above version.

@cuda.jit
def count_leq(x, y, out):
    i = cuda.grid(1)
    if i < len(x):
        out[0] += x[i] <= y[i]

You need to perform an atomic add operation explicitly :

@cuda.jit
def count_leq(x, y, out):
    i = cuda.grid(1)
    if i < len(x):
        cuda.atomic.add(out, 0, x[i] <= y[i])

Atomic adds are optimized on relatively new devices using for example an hardware warp reduction, but the operation tends not to scale when a large number of streaming-multiprocessors perform an atomic operations.

One solution to increase the performance of this kernel is to perform a block reduction of many values assuming the array is large enough. In practice, each thread can sum multiple items and perform one atomic operation in the end. The code should look like this (untested):

# Must be launched with different parameters since 
# each threads works on more array items.
# The number of block should be 16 times smaller.
@cuda.jit
def count_leq(x, y, out):
    tid = cuda.threadIdx.x
    bid = cuda.blockIdx.x
    bdim = cuda.blockDim.x
    i = (bid * bdim * 16) + tid

    s = 0

    # Fast general case (far from the end of the arrays)
    if i+16*bdim < len(x):
        # Thread-local reduction
        # This loop should be unrolled
        for j in range(16):
            idx = i + j * bdim
            s += x[idx] <= y[idx]

    # Slower corner case (close to end of the arrays: checks are needed)
    else:
        for j in range(16):
            idx = i + j * bdim
            if idx < len(x):
                s += x[idx] <= y[idx]

    cuda.atomic.add(out, 0, s)

Note that 16 is an arbitrary value. It is certainly faster to use a bigger value like 64 for huge array and a smaller value for relatively small arrays.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM