Questions about share memory in the tree reduction

Question

On page 22 of NVIDIA Optimizing Parallel Reduction in CUDA ,

__device__ void warpReduce(volatile int* sdata, int tid) {
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}

// later…
for (unsigned int s=blockDim.x/2; s>32; s>>=1) {
if (tid < s)
    sdata[tid] += sdata[tid + s];
__syncthreads();
}
if (tid < 32) warpReduce(sdata, tid);

I am not sure how CUDA processes the operation in the last line of the function warpReduce(...) : sdata[tid] += sdata[tid + 1]; . If a warp has thread ID tid from 0 to 31, does tid 0 and 1 conflict at thie moment? What I mean is, how will CUDA choose the order of sdata[0] += sdata[0 + 1];and sdata[1] += sdata[1 + 1]; ? If it runs sdata[0] += sdata[0 + 1]; first, then the result is correct. Or, CUDA do something to prevent this confusion/conflicts?

Answer 1

Those slides are horribly outdated. Using shared memory for reductions within a single warp hasn't been state-of-the-art since warp shuffle instructions were invented in CUDA-3.0 (IIRC). And even those are obsolete at this point.

These days you would use the warp reduce functions such as __reduce_add_sync .

As for why the pattern above used to work: Prior to the Volta architecture, threads in a warp operated in lockstep. They would all first load the value, then store it. Therefore synchronization was unnecessary. This has changed. Note this from Nvidia's Volta Tuning Guide

Applications that assume reads and writes are implicitly visible to other threads in the same warp need to insert the new __syncwarp() warp-wide barrier synchronization instruction between steps where data is exchanged between threads via global or shared memory. Assumptions that code is executed in lockstep or that reads/writes from separate threads are visible across a warp without synchronization are invalid.

You may find newer sample code in the Github repository: https://github.com/NVIDIA/cuda-samples

There may also be a newer version of those slides out there but I don't know where.

Questions about share memory in the tree reduction

Question

1 answers

solution1
3 ACCPTED 2022-06-27 08:53:14

Questions about share memory in the tree reduction

Question

1 answers

solution1 3 ACCPTED 2022-06-27 08:53:14

solution1
3 ACCPTED 2022-06-27 08:53:14