使用 CUDA atomicInc 获取唯一索引

Question

I have CUDA kernel where basically each thread holds a value, and it needs to add that value to one or more lists in shared memory.我有 CUDA kernel 基本上每个线程都有一个值，它需要将该值添加到共享 memory 中的一个或多个列表中。 So for each of those lists, it needs to get an index value (unique for that list) to put the value.因此，对于这些列表中的每一个，它都需要获取一个索引值（对于该列表来说是唯一的）来放置该值。

The real code is different, but there are lists like:真正的代码是不同的，但有如下列表：

typedef struct {
    unsigned int numItems;
    float items[MAX_NUM_ITEMS];
} List;
__shared__ List lists[NUM_LISTS];

The values numItems are initially all set to 0, and then a __syncthreads() is done. numItems的值最初都设置为 0，然后执行__syncthreads() 。

To add its value to the lists, each thread does:要将其值添加到列表中，每个线程都会：

for(int list = 0; list < NUM_LISTS; ++list) {
    if(should_add_to_list(threadIdx, list)) {
        unsigned int index = atomicInc(&lists[list].numItems, 0xffffffff);
        assert(index < MAX_NUM_ITEMS); // always true
        lists[list].items[index] = my_value;
    }
}

This works most of the time, but it seems that when making some unrelated changes in other parts of the kernel (such as not checking asserts that always succeed), sometimes two threads get the same index for one list, or indices are skipped.这在大多数情况下都有效，但似乎在 kernel 的其他部分进行一些不相关的更改时（例如不检查总是成功的断言），有时两个线程会为一个列表获得相同的索引，或者索引被跳过。 The final value of numSamples always becomes correct, however.但是， numSamples的最终值总是正确的。

However, when using the following custom implementation for atomicInc_ instead, it seems to work correctly:但是，当对atomicInc_使用以下自定义实现时，它似乎可以正常工作：

__device__ static inline uint32_t atomicInc_(uint32_t* ptr) {
    uint32_t value;
    do {
        value = *ptr;
    } while(atomicCAS(ptr, value, value + 1) != value);
    return value;
}

Are the two atomicInc functions equivalent, and is it valid to use atomicInc that way to get unique indices?这两个atomicInc函数是否等效，以这种方式使用atomicInc来获取唯一索引是否有效？

According the the CUDA programming guide , the atomic functions do not imply memory ordering constraints, and different threads can access the numSamples of different lists at the same time: could this cause it to fail?根据CUDA 编程指南，原子函数并不意味着 memory 排序约束，并且不同的线程可以同时访问不同列表的numSamples ：这会导致它失败吗？

Edit:编辑：

The real kernel looks like this:真正的 kernel 是这样的：

Basically there is a list of spot blocks , containing spots .基本上有一个点块列表，包含点。 Each spot has XY coordinates ( col , row ).每个点都有 XY 坐标（ col ， row ）。 The kernel needs to find, for each spot, the spots that are in a certain window (col/row difference) around it, and put them into a list in shared memory. kernel 需要为每个点找到它周围某个 window（列/行差异）中的点，并将它们放入共享 memory 的列表中。

The kernel is called with a fixed number of warps. kernel 以固定数量的经线调用。 A CUDA block corresponds to a group of spot blocks .一个 CUDA 块对应于一组点块。 (here 3) These are called the local spot blocks. （此处为 3）这些被称为局部斑点块。

First it takes the spots from the block's 3 spot blocks, and copies them into shared memory ( localSpots[] ).首先，它从块的 3 个点块中获取点，并将它们复制到共享的 memory ( localSpots[] )。 For this it uses one warp for each spot block, so that the spots can be read coalesced.为此，它为每个点块使用一个扭曲，以便可以合并读取点。 Each thread in the warp is a spot in the local spot block.经线中的每个线程都是本地点块中的一个点。 The spot block indices are here hardcoded ( blocks[] ).现场块索引在这里是硬编码的（ blocks[] ）。

Then it goes through the surrounding spot blocks: These are all the spot blocks that may contain spots that are close enough to a spot in the local spot blocks .然后它会穿过周围的点块：这些点块都是可能包含与本地点块中的点足够近的点的点块。 The surrounding spot block's indices are also hardcoded here ( sblock[] ).周围的点块的索引也在这里硬编码（ sblock[] ）。

In this example it only uses the first warp for this, and traverses sblocks[] iteratively.在这个例子中，它只使用了第一个 warp，并迭代地遍历sblocks[] 。 Each thread in the warp is a spot in the surrounding spot block.经线中的每个线程都是周围点块中的一个点。 It also iterates through the list of all the local spots.它还遍历所有本地点的列表。 If the thread's spot is close enough to the local spot: It inserts it into the local spot's list, using atomicInc to get an index.如果线程的点与本地点足够近：它将它插入到本地点的列表中，使用atomicInc获取索引。

When executed, the printf shows that for a given local spot (here the one with row=37, col=977), indices are sometimes repeated or skipped.执行时，printf 显示对于给定的本地点（此处为 row=37，col=977 的点），有时会重复或跳过索引。

The real code is more complex/optimized, but this code already has the problem.真正的代码更复杂/优化，但这段代码已经有问题了。 Here it also only runs one CUDA block.在这里它也只运行一个 CUDA 块。

#include <assert.h>
#include <stdio.h>

#define MAX_NUM_SPOTS_IN_WINDOW 80

__global__ void Kernel(
    const uint16_t* blockNumSpotsBuffer,
    XGPU_SpotProcessingBlockSpotDataBuffers blockSpotsBuffers,
    size_t blockSpotsBuffersElementPitch,
    int2 unused1,
    int2 unused2,
    int unused3 ) {
    typedef unsigned int uint;

    if(blockIdx.x!=30 || blockIdx.y!=1) return;

    int window = 5;

    ASSERT(blockDim.x % WARP_SIZE == 0);
    ASSERT(blockDim.y == 1);

    uint numWarps = blockDim.x / WARP_SIZE;
    uint idxWarp = threadIdx.x / WARP_SIZE;
    int idxThreadInWarp = threadIdx.x % WARP_SIZE;

    struct Spot {
        int16_t row;
        int16_t col;
        volatile unsigned int numSamples;
        float signalSamples[MAX_NUM_SPOTS_IN_WINDOW];
    };

    __shared__ uint numLocalSpots;
    __shared__ Spot localSpots[3 * 32];

    numLocalSpots = 0;

    __syncthreads();

    ASSERT(numWarps >= 3);
    int blocks[3] = {174, 222, 270};
    if(idxWarp < 3) {
        uint spotBlockIdx = blocks[idxWarp];
        ASSERT(spotBlockIdx < numSpotBlocks.x * numSpotBlocks.y);

        uint numSpots = blockNumSpotsBuffer[spotBlockIdx];
        ASSERT(numSpots < WARP_SIZE);

        size_t inOffset = (spotBlockIdx * blockSpotsBuffersElementPitch) + idxThreadInWarp;

        uint outOffset;
        if(idxThreadInWarp == 0) outOffset = atomicAdd(&numLocalSpots, numSpots);
        outOffset = __shfl_sync(0xffffffff, outOffset, 0, 32);

        if(idxThreadInWarp < numSpots) {
            Spot* outSpot = &localSpots[outOffset + idxThreadInWarp];
            outSpot->numSamples = 0;

            uint32_t coord = blockSpotsBuffers.coord[inOffset];
            UnpackCoordinates(coord, &outSpot->row, &outSpot->col);
        }
    }



    __syncthreads();


    int sblocks[] = { 29,30,31,77,78,79,125,126,127,173,174,175,221,222,223,269,270,271,317,318,319,365,366,367,413,414,415 };
    if(idxWarp == 0) for(int block = 0; block < sizeof(sblocks)/sizeof(int); ++block) {
        uint spotBlockIdx = sblocks[block];
        ASSERT(spotBlockIdx < numSpotBlocks.x * numSpotBlocks.y);

        uint numSpots = blockNumSpotsBuffer[spotBlockIdx];
        uint idxThreadInWarp = threadIdx.x % WARP_SIZE;
        if(idxThreadInWarp >= numSpots) continue;

        size_t inOffset = (spotBlockIdx * blockSpotsBuffersElementPitch) + idxThreadInWarp;

        uint32_t coord = blockSpotsBuffers.coord[inOffset];
        if(coord == 0) return; // invalid surrounding spot

        int16_t row, col;
        UnpackCoordinates(coord, &row, &col);

        for(int idxLocalSpot = 0; idxLocalSpot < numLocalSpots; ++idxLocalSpot) {
            Spot* localSpot = &localSpots[idxLocalSpot];

            if(localSpot->row == 0 && localSpot->col == 0) continue;
            if((abs(localSpot->row - row) >= window) && (abs(localSpot->col - col) >= window)) continue;

            int index = atomicInc_block((unsigned int*)&localSpot->numSamples, 0xffffffff);
            if(localSpot->row == 37 && localSpot->col == 977) printf("%02d  ", index); // <-- sometimes indices are skipped or duplicated

            if(index >= MAX_NUM_SPOTS_IN_WINDOW) continue; // index out of bounds, discard value for median calculation
            localSpot->signalSamples[index] = blockSpotsBuffers.signal[inOffset];
        }
    } }

Output looks like this: Output 看起来像这样：

00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  23                                                                                                                   
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24                                                                                                                 
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24        
00  01  02  02  03  03  04  05  06  07  08  09  10  11  12  06  13  14  15  16  17  18  19  20  21        
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24        
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24        
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  23        
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24        
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24

Each line is the output of one execution (the kernel is run multiple times).每行是一次执行的 output（kernel 运行多次）。 It is expected that indices appear in different orders.预计指数以不同的顺序出现。 But for example on the third-last line, index 23 is repeated.但是例如在倒数第三行，索引 23 被重复。

Using atomicCAS seems to fix it.使用atomicCAS似乎可以解决它。 Also using __syncwarp() between executions on the outer for-loop seems to fix it.在外部 for 循环的执行之间也使用__syncwarp()似乎可以解决它。 But it is not clear why, and if that always fixes it.但目前尚不清楚为什么，如果这总能解决问题。

Edit 2: This is a full program (main.cu) that shows the problem:编辑2：这是一个显示问题的完整程序（main.cu）：

https://pastebin.com/cDqYmjGb https://pastebin.com/cDqYmjGb

The CMakeLists.txt: CMakeLists.txt：

https://pastebin.com/iB9mbUJw https://pastebin.com/iB9mbUJw

Must be compiled with -DCMAKE_BUILD_TYPE=Release.必须使用 -DCMAKE_BUILD_TYPE=Release 进行编译。

It produces this output:它产生这个 output：

00(0:00000221E40003E0)
01(2:00000221E40003E0)
02(7:00000221E40003E0)
03(1:00000221E40003E0)
03(2:00000221E40003E0)
04(3:00000221E40003E0)
04(1:00000221E40003E0)
05(4:00000221E40003E0)
06(6:00000221E40003E0)
07(2:00000221E40003E0)
08(3:00000221E40003E0)
09(6:00000221E40003E0)
10(3:00000221E40003E0)
11(5:00000221E40003E0)
12(0:00000221E40003E0)
13(1:00000221E40003E0)
14(3:00000221E40003E0)
15(1:00000221E40003E0)
16(0:00000221E40003E0)
17(3:00000221E40003E0)
18(0:00000221E40003E0)
19(2:00000221E40003E0)
20(4:00000221E40003E0)
21(4:00000221E40003E0)
22(1:00000221E40003E0)

For example the lines with 03 show that two threads (1 and 2), get the same result (3), after calling atomicInc_block on the same counter (at 0x00000221E40003E0 ).例如，带有 03 的行显示两个线程（1 和 2）在同一个计数器（在0x00000221E40003E0处）调用atomicInc_block后，得到相同的结果（3）。

Answer 1

According to my testing, this problem is fixed in CUDA 11.4.1 currently available here and driver 470.52.02.根据我的测试，此问题已在CUDA 11.4.1 和驱动程序 470.52.02 中修复。 It may also be fixed in some earlier versions of CUDA 11.4 and 11.3, but the problem is present in CUDA 11.2.它也可能在 CUDA 11.4 和 11.3 的某些早期版本中得到修复，但问题存在于 CUDA 11.2 中。

使用 CUDA atomicInc 获取唯一索引

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-08-14 18:27:07

使用 CUDA atomicInc 获取唯一索引

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-08-14 18:27:07

解决方案1
1 已采纳 2021-08-14 18:27:07