使用 CUDA atomicInc 獲取唯一索引

Question

我有 CUDA kernel 基本上每個線程都有一個值，它需要將該值添加到共享 memory 中的一個或多個列表中。 因此，對於這些列表中的每一個，它都需要獲取一個索引值（對於該列表來說是唯一的）來放置該值。

真正的代碼是不同的，但有如下列表：

typedef struct {
    unsigned int numItems;
    float items[MAX_NUM_ITEMS];
} List;
__shared__ List lists[NUM_LISTS];

numItems的值最初都設置為 0，然后執行__syncthreads() 。

要將其值添加到列表中，每個線程都會：

for(int list = 0; list < NUM_LISTS; ++list) {
    if(should_add_to_list(threadIdx, list)) {
        unsigned int index = atomicInc(&lists[list].numItems, 0xffffffff);
        assert(index < MAX_NUM_ITEMS); // always true
        lists[list].items[index] = my_value;
    }
}

這在大多數情況下都有效，但似乎在 kernel 的其他部分進行一些不相關的更改時（例如不檢查總是成功的斷言），有時兩個線程會為一個列表獲得相同的索引，或者索引被跳過。 但是， numSamples的最終值總是正確的。

但是，當對atomicInc_使用以下自定義實現時，它似乎可以正常工作：

__device__ static inline uint32_t atomicInc_(uint32_t* ptr) {
    uint32_t value;
    do {
        value = *ptr;
    } while(atomicCAS(ptr, value, value + 1) != value);
    return value;
}

這兩個atomicInc函數是否等效，以這種方式使用atomicInc來獲取唯一索引是否有效？

根據CUDA 編程指南，原子函數並不意味着 memory 排序約束，並且不同的線程可以同時訪問不同列表的numSamples ：這會導致它失敗嗎？

編輯：

真正的 kernel 是這樣的：

基本上有一個點塊列表，包含點。 每個點都有 XY 坐標（ col ， row ）。 kernel 需要為每個點找到它周圍某個 window（列/行差異）中的點，並將它們放入共享 memory 的列表中。

kernel 以固定數量的經線調用。 一個 CUDA 塊對應於一組點塊。 （此處為 3）這些被稱為局部斑點塊。

首先，它從塊的 3 個點塊中獲取點，並將它們復制到共享的 memory ( localSpots[] )。 為此，它為每個點塊使用一個扭曲，以便可以合並讀取點。 經線中的每個線程都是本地點塊中的一個點。 現場塊索引在這里是硬編碼的（ blocks[] ）。

然后它會穿過周圍的點塊：這些點塊都是可能包含與本地點塊中的點足夠近的點的點塊。 周圍的點塊的索引也在這里硬編碼（ sblock[] ）。

在這個例子中，它只使用了第一個 warp，並迭代地遍歷sblocks[] 。 經線中的每個線程都是周圍點塊中的一個點。 它還遍歷所有本地點的列表。 如果線程的點與本地點足夠近：它將它插入到本地點的列表中，使用atomicInc獲取索引。

執行時，printf 顯示對於給定的本地點（此處為 row=37，col=977 的點），有時會重復或跳過索引。

真正的代碼更復雜/優化，但這段代碼已經有問題了。 在這里它也只運行一個 CUDA 塊。

#include <assert.h>
#include <stdio.h>

#define MAX_NUM_SPOTS_IN_WINDOW 80

__global__ void Kernel(
    const uint16_t* blockNumSpotsBuffer,
    XGPU_SpotProcessingBlockSpotDataBuffers blockSpotsBuffers,
    size_t blockSpotsBuffersElementPitch,
    int2 unused1,
    int2 unused2,
    int unused3 ) {
    typedef unsigned int uint;

    if(blockIdx.x!=30 || blockIdx.y!=1) return;

    int window = 5;

    ASSERT(blockDim.x % WARP_SIZE == 0);
    ASSERT(blockDim.y == 1);

    uint numWarps = blockDim.x / WARP_SIZE;
    uint idxWarp = threadIdx.x / WARP_SIZE;
    int idxThreadInWarp = threadIdx.x % WARP_SIZE;

    struct Spot {
        int16_t row;
        int16_t col;
        volatile unsigned int numSamples;
        float signalSamples[MAX_NUM_SPOTS_IN_WINDOW];
    };

    __shared__ uint numLocalSpots;
    __shared__ Spot localSpots[3 * 32];

    numLocalSpots = 0;

    __syncthreads();

    ASSERT(numWarps >= 3);
    int blocks[3] = {174, 222, 270};
    if(idxWarp < 3) {
        uint spotBlockIdx = blocks[idxWarp];
        ASSERT(spotBlockIdx < numSpotBlocks.x * numSpotBlocks.y);

        uint numSpots = blockNumSpotsBuffer[spotBlockIdx];
        ASSERT(numSpots < WARP_SIZE);

        size_t inOffset = (spotBlockIdx * blockSpotsBuffersElementPitch) + idxThreadInWarp;

        uint outOffset;
        if(idxThreadInWarp == 0) outOffset = atomicAdd(&numLocalSpots, numSpots);
        outOffset = __shfl_sync(0xffffffff, outOffset, 0, 32);

        if(idxThreadInWarp < numSpots) {
            Spot* outSpot = &localSpots[outOffset + idxThreadInWarp];
            outSpot->numSamples = 0;

            uint32_t coord = blockSpotsBuffers.coord[inOffset];
            UnpackCoordinates(coord, &outSpot->row, &outSpot->col);
        }
    }



    __syncthreads();


    int sblocks[] = { 29,30,31,77,78,79,125,126,127,173,174,175,221,222,223,269,270,271,317,318,319,365,366,367,413,414,415 };
    if(idxWarp == 0) for(int block = 0; block < sizeof(sblocks)/sizeof(int); ++block) {
        uint spotBlockIdx = sblocks[block];
        ASSERT(spotBlockIdx < numSpotBlocks.x * numSpotBlocks.y);

        uint numSpots = blockNumSpotsBuffer[spotBlockIdx];
        uint idxThreadInWarp = threadIdx.x % WARP_SIZE;
        if(idxThreadInWarp >= numSpots) continue;

        size_t inOffset = (spotBlockIdx * blockSpotsBuffersElementPitch) + idxThreadInWarp;

        uint32_t coord = blockSpotsBuffers.coord[inOffset];
        if(coord == 0) return; // invalid surrounding spot

        int16_t row, col;
        UnpackCoordinates(coord, &row, &col);

        for(int idxLocalSpot = 0; idxLocalSpot < numLocalSpots; ++idxLocalSpot) {
            Spot* localSpot = &localSpots[idxLocalSpot];

            if(localSpot->row == 0 && localSpot->col == 0) continue;
            if((abs(localSpot->row - row) >= window) && (abs(localSpot->col - col) >= window)) continue;

            int index = atomicInc_block((unsigned int*)&localSpot->numSamples, 0xffffffff);
            if(localSpot->row == 37 && localSpot->col == 977) printf("%02d  ", index); // <-- sometimes indices are skipped or duplicated

            if(index >= MAX_NUM_SPOTS_IN_WINDOW) continue; // index out of bounds, discard value for median calculation
            localSpot->signalSamples[index] = blockSpotsBuffers.signal[inOffset];
        }
    } }

Output 看起來像這樣：

00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  23                                                                                                                   
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24                                                                                                                 
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24        
00  01  02  02  03  03  04  05  06  07  08  09  10  11  12  06  13  14  15  16  17  18  19  20  21        
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24        
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24        
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  23        
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24        
00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24

每行是一次執行的 output（kernel 運行多次）。 預計指數以不同的順序出現。 但是例如在倒數第三行，索引 23 被重復。

使用atomicCAS似乎可以解決它。 在外部 for 循環的執行之間也使用__syncwarp()似乎可以解決它。 但目前尚不清楚為什么，如果這總能解決問題。

編輯2：這是一個顯示問題的完整程序（main.cu）：

https://pastebin.com/cDqYmjGb

CMakeLists.txt：

https://pastebin.com/iB9mbUJw

必須使用 -DCMAKE_BUILD_TYPE=Release 進行編譯。

它產生這個 output：

00(0:00000221E40003E0)
01(2:00000221E40003E0)
02(7:00000221E40003E0)
03(1:00000221E40003E0)
03(2:00000221E40003E0)
04(3:00000221E40003E0)
04(1:00000221E40003E0)
05(4:00000221E40003E0)
06(6:00000221E40003E0)
07(2:00000221E40003E0)
08(3:00000221E40003E0)
09(6:00000221E40003E0)
10(3:00000221E40003E0)
11(5:00000221E40003E0)
12(0:00000221E40003E0)
13(1:00000221E40003E0)
14(3:00000221E40003E0)
15(1:00000221E40003E0)
16(0:00000221E40003E0)
17(3:00000221E40003E0)
18(0:00000221E40003E0)
19(2:00000221E40003E0)
20(4:00000221E40003E0)
21(4:00000221E40003E0)
22(1:00000221E40003E0)

例如，帶有 03 的行顯示兩個線程（1 和 2）在同一個計數器（在0x00000221E40003E0處）調用atomicInc_block后，得到相同的結果（3）。

Answer 1

根據我的測試，此問題已在CUDA 11.4.1 和驅動程序 470.52.02 中修復。 它也可能在 CUDA 11.4 和 11.3 的某些早期版本中得到修復，但問題存在於 CUDA 11.2 中。

使用 CUDA atomicInc 獲取唯一索引

問題描述

1 個解決方案

解決方案1
1 已采納 2021-08-14 18:27:07

使用 CUDA atomicInc 獲取唯一索引

問題描述

1 個解決方案

解決方案1 1 已采納 2021-08-14 18:27:07

解決方案1
1 已采納 2021-08-14 18:27:07