Numba - CUDA kernel 中的共享 memory 未正确更新

Question

Consider the following kernel, which counts the number of elements in x which are less than or equal to the corresponding element in y .考虑以下 kernel，它计算x中小于或等于y中相应元素的元素数。

@cuda.jit
def count_leq(x, y, out):
    i = cuda.grid(1)
    shared = cuda.shared.array(1, dtype=DTYPE)
    if i < len(x):
        shared[0] += x[i] <= y[i]
    cuda.syncthreads()
    out[0] = shared[0]

However, the increments from each thread are not being saved properly in the shared array.但是，每个线程的增量没有正确保存在共享数组中。

a = cuda.to_device(np.arange(5))  # [0 1 2 3 4]
b = cuda.to_device(np.arange(5))  # [0 1 2 3 4]
out = cuda.to_device(np.zeros(1)) # [0]
count_leq[1,len(a)](a, b, out)
print(out[0])                     # 1.0, but should be 5.0

What am I doing wrong here?我在这里做错了什么？ I'm confused because cuda.shared.array is shared by all threads in a given block, right?我很困惑，因为cuda.shared.array由给定块中的所有线程共享，对吧？ How do I accumulate the increments using the same 1-element array?如何使用相同的 1 元素数组累积增量？

I also tried the following, which failed with the same behavior as the above version.我还尝试了以下方法，但失败了与上述版本相同的行为。

@cuda.jit
def count_leq(x, y, out):
    i = cuda.grid(1)
    if i < len(x):
        out[0] += x[i] <= y[i]

Answer 1

You need to perform an atomic add operation explicitly :您需要显式执行原子添加操作：

@cuda.jit
def count_leq(x, y, out):
    i = cuda.grid(1)
    if i < len(x):
        cuda.atomic.add(out, 0, x[i] <= y[i])

Atomic adds are optimized on relatively new devices using for example an hardware warp reduction, but the operation tends not to scale when a large number of streaming-multiprocessors perform an atomic operations.原子添加在相对较新的设备上进行了优化，例如使用硬件扭曲减少，但是当大量流式多处理器执行原子操作时，操作往往不会扩展。

One solution to increase the performance of this kernel is to perform a block reduction of many values assuming the array is large enough.提高此 kernel 性能的一种解决方案是在阵列足够大的情况下执行许多值的块缩减。 In practice, each thread can sum multiple items and perform one atomic operation in the end.在实践中，每个线程可以对多个项目求和，最后执行一个原子操作。 The code should look like this (untested):代码应如下所示（未经测试）：

# Must be launched with different parameters since 
# each threads works on more array items.
# The number of block should be 16 times smaller.
@cuda.jit
def count_leq(x, y, out):
    tid = cuda.threadIdx.x
    bid = cuda.blockIdx.x
    bdim = cuda.blockDim.x
    i = (bid * bdim * 16) + tid

    s = 0

    # Fast general case (far from the end of the arrays)
    if i+16*bdim < len(x):
        # Thread-local reduction
        # This loop should be unrolled
        for j in range(16):
            idx = i + j * bdim
            s += x[idx] <= y[idx]

    # Slower corner case (close to end of the arrays: checks are needed)
    else:
        for j in range(16):
            idx = i + j * bdim
            if idx < len(x):
                s += x[idx] <= y[idx]

    cuda.atomic.add(out, 0, s)

Note that 16 is an arbitrary value.请注意，16 是任意值。 It is certainly faster to use a bigger value like 64 for huge array and a smaller value for relatively small arrays.对于大型阵列使用较大的值（如 64）和相对较小的 arrays 使用较小的值肯定会更快。

Numba - CUDA kernel 中的共享 memory 未正确更新

问题描述

1 个解决方案

解决方案1
1 2022-01-28 12:21:58

Numba - CUDA kernel 中的共享 memory 未正确更新

问题描述

1 个解决方案

解决方案1 1 2022-01-28 12:21:58

解决方案1
1 2022-01-28 12:21:58