关于CUDA中每个SM分配的寄存器数量

Question

First Question. 第一个问题。 The CUDA C Programming Guide is written like below. CUDA C编程指南的编写方式如下。

The same on-chip memory is used for both L1 and shared memory: It can be configured as 48 KB of shared memory and 16 KB of L1 cache or as 16 KB of shared memory and 48 KB of L1 cache L1和共享内存使用相同的片上内存：可以将其配置为48 KB共享内存和16 KB L1高速缓存，也可以配置为16 KB共享内存和48 KB L1高速缓存

But, device query shows "Total number of registers available per block: 32768". 但是，设备查询显示“每个块可用的寄存器总数：32768”。 I use GTX580.(CC is 2.0) The guide says default cache size is 16KB, but 32768 means 32768*4(byte) = 131072 Bytes = 128 KBytes. 我使用GTX580。（CC为2.0）该指南说默认缓存大小为16KB，但32768表示32768 * 4（byte）= 131072 Bytes = 128 KBytes。 Actually, I don't know which is correct. 实际上，我不知道哪个是正确的。

Second Question. 第二个问题。 I set like below, 我设定如下

dim3    grid(32, 32);            //blocks in a grid
dim3    block(16, 16);           //threads in a block
kernel<<<grid,block>>>(...);

Then, the number of threads per a block is 256. => we need 256*N registers per a block. 那么，每个块的线程数为256。=>每个块需要256 * N个寄存器。 N means the number of registers per a thread needed. N表示每个线程所需的寄存器数。 (256*N)*blocks is the number of registers per a SM.(not byte) So, if default size is 16KB and threads/SM is MAX(1536), then N can't over 2. Because of "Maximum number of threads per multiprocessor: 1536". （256 * N）* blocks是每个SM的寄存器数。（不是字节）因此，如果默认大小为16KB，线程/ SM为MAX（1536），则N不能超过2。每个多处理器的线程数：1536“。 16KB/4Bytes = 4096 registers, 4096/1536 = 2.66666... 16KB / 4字节= 4096个寄存器，4096/1536 = 2.66666 ...

In case of larger caches 48KB, N can't over 8. 48KB/4Bytes = 12288 registers, 12288/1536 = 8 如果使用更大的缓存48KB，则N不能超过8。48KB / 4Bytes = 12288个寄存器，12288/1536 = 8个

Is that true? 真的吗？ Actually I'm so confused. 其实我很困惑。

Actually, My almost full code is here. 实际上，我几乎完整的代码在这里。 I think, the kernel is optimized when the block dimension is 16x16. 我认为，当块尺寸为16x16时，内核已优化。 But, in case of 8x8, faster than 16x16 or similar. 但是，如果是8x8，则比16x16或类似的速度更快。 I don't know the why. 我不知道为什么。

the number of registers per a thread is 16, the shared memory is 80+16 bytes. 每个线程的寄存器数为16，共享内存为80 + 16字节。

I had asked same question, but I couldn't get the exact solution.: The result of an experiment different from CUDA Occupancy Calculator 我曾问过同样的问题，但我无法找到确切的解决方案。：与CUDA占用率计算器不同的实验结果

#define WIDTH 512
#define HEIGHT 512
#define TILE_WIDTH 8
#define TILE_HEIGHT 8
#define CHANNELS 3
#define DEVICENUM 1 
#define HEIGHTs HEIGHT/DEVICENUM

__global__ void PRINT_POLYGON( unsigned char *IMAGEin, int *MEMin, char a, char b, char c){
        int Col = blockIdx.y*blockDim.y+ threadIdx.y;           //Col is y coordinate
        int Row = blockIdx.x*blockDim.x+ threadIdx.x;           //Row is x coordinate
        int tid_in_block = threadIdx.x + threadIdx.y*blockDim.x;
        int bid_in_grid = blockIdx.x + blockIdx.y*gridDim.x;
        int threads_per_block = blockDim.x * blockDim.y;
        int tid_in_grid = tid_in_block + threads_per_block * bid_in_grid;

        float result_a, result_b;
        __shared__ int M[15];
        for(int k = 0; k < 5; k++){
                M[k] = MEMin[a*5+k];
                M[k+5] = MEMin[b*5+k];
                M[k+10] = MEMin[c*5+k];
        }

        int result_a_up = (M[11]-M[1])*(Row-M[0]) - (M[10]-M[0])*(Col-M[1]);
        int result_b_up = (M[6] -M[1])*(M[0]-Row) - (M[5] -M[0])*(M[1]-Col);

        int result_down = (M[11]-M[1])*(M[5]-M[0]) - (M[6]-M[1])*(M[10]-M[0]);

        result_a = (float)result_a_up / (float)result_down;
        result_b = (float)result_b_up / (float)result_down;

        if((0 <= result_a && result_a <=1) && ((0 <= result_b && result_b <= 1)) && ((0 <= (result_a+result_b) && (result_a+result_b) <= 1))){
                IMAGEin[tid_in_grid*CHANNELS] += M[2] + (M[7]-M[2])*result_a + (M[12]-M[2])*result_b;      //Red Channel
                IMAGEin[tid_in_grid*CHANNELS+1] += M[3] + (M[8]-M[3])*result_a + (M[13]-M[3])*result_b;    //Green Channel
                IMAGEin[tid_in_grid*CHANNELS+2] += M[4] + (M[9]-M[4])*result_a + (M[14]-M[4])*result_b;    //Blue Channel
        }
}

struct DataStruct {
    int                 deviceID;
    unsigned char       IMAGE_SEG[WIDTH*HEIGHTs*CHANNELS];
};

void* routine( void *pvoidData ) { 
        DataStruct  *data = (DataStruct*)pvoidData;
        unsigned char *dev_IMAGE;
        int *dev_MEM;
        unsigned char *IMAGE_SEG = data->IMAGE_SEG;

        HANDLE_ERROR(cudaSetDevice(5));

        //initialize array
        memset(IMAGE_SEG, 0, WIDTH*HEIGHTs*CHANNELS);
        cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
        printf("Device %d Starting..\n", data->deviceID);

        //Evaluate Time
        cudaEvent_t start, stop;
        cudaEventCreate( &start );
        cudaEventCreate( &stop );

        cudaEventRecord(start, 0); 

        HANDLE_ERROR( cudaMalloc( (void **)&dev_MEM, sizeof(int)*35) );
        HANDLE_ERROR( cudaMalloc( (void **)&dev_IMAGE, sizeof(unsigned char)*WIDTH*HEIGHTs*CHANNELS) );

        cudaMemcpy(dev_MEM, MEM, sizeof(int)*35, cudaMemcpyHostToDevice);
        cudaMemset(dev_IMAGE, 0, sizeof(unsigned char)*WIDTH*HEIGHTs*CHANNELS);

        dim3    grid(WIDTH/TILE_WIDTH, HEIGHTs/TILE_HEIGHT);            //blocks in a grid
        dim3    block(TILE_WIDTH, TILE_HEIGHT);                         //threads in a block

        PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 1, 2);
        PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 2, 3);
        PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 3, 4);
        PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 0, 4, 5);
        PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 3, 2, 4);
        PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, 2, 6, 4);

        HANDLE_ERROR( cudaMemcpy( IMAGE_SEG, dev_IMAGE, sizeof(unsigned char)*WIDTH*HEIGHTs*CHANNELS, cudaMemcpyDeviceToHost ) );
        HANDLE_ERROR( cudaFree( dev_MEM ) );
        HANDLE_ERROR( cudaFree( dev_IMAGE ) );

        cudaEventRecord(stop, 0); 
        cudaEventSynchronize(stop);

        cudaEventElapsedTime( &elapsed_time_ms[data->deviceID], start, stop );
        cudaEventDestroy(start);
        cudaEventDestroy(stop);


        elapsed_time_ms[DEVICENUM] += elapsed_time_ms[data->deviceID];
        printf("Device %d Complete!\n", data->deviceID);

        return 0;
}

Answer 1

The blockDim 8x8 is faster than 16x16 due to the increase in address divergence in your memory access when you increase the block size. 由于增加块大小时内存访问中地址差异的增加，blockDim 8x8的速度比16x16快。

Metrics collected on GTX480 with 15 SMs. 使用15个SM在GTX480上收集的指标。

metric                         8x8         16x16
duration                        161µs       114µs
issued_ipc                     1.24        1.31
executed_ipc                    .88         .59
serialization                 54.61%      28.74%

The number of instruction replays clues us in that we likely have bad memory access patterns. 指令重播的数量为我们提供了线索，因为我们可能使用了错误的内存访问模式。

achieved occupancy            88.32%      30.76%
0 warp schedulers issues       8.81%       7.98%
1 warp schedulers issues       2.36%      29.54%
2 warp schedulers issues      88.83%      52.44%

16x16 appears to keep the warp scheduler busy. 16x16似乎使翘曲调度程序保持忙碌状态。 However, it is keeping the schedulers busy re-issuing instructions. 但是，这使调度程序忙于重新发出指令。

l1 global load trans          524,407     332,007
l1 global store trans         401,224     209,139
l1 global load trans/request    3.56        2.25
l1 global store trans/request  16.33        8.51

The first priority is to reduce transactions per request. 第一要务是减少每个请求的事务。 The Nsight VSE source view can display memory statistics per instruction. Nsight VSE源视图可以按指令显示内存统计信息。 The primary issue in your kernel is the interleaved U8 load and stores for IMAGEin[] += value. 内核中的主要问题是交错的U8负载并存储IMAGEin [] + =值。 At 16x16 this is resulting in 16.3 transactions per request but only 8.3 for 8x8 configuration. 在16x16的分辨率下，每个请求的交易量为16.3，而对于8x8的配置，则仅为8.3。

Changing IMAGEin[(i*HEIGHTs+j)*CHANNELS] += ... 更改IMAGEin [（i * HEIGHTs + j）* CHANNELS] + = ...

to be consecutive increases performance of 16x16 by 3x. 连续将16x16的性能提高3倍。 I imagine increasing channels to 4 and handling packing in the kernel will improve cache performance and memory throughput. 我想将通道增加到4并在内核中处理打包将提高缓存性能和内存吞吐量。

If you fix the number of memory transactions per request you will then likely have to look at execution dependencies and try to increase your ILP. 如果您固定了每个请求的内存事务数，则可能必须查看执行依赖性并尝试增加ILP。

Answer 2

It is faster with block size of 8x8 because it is a lesser multiple of 32, as it is visible in the picture below, there are 32 CUDA cores bound together, with two different warp schedulers that actually schedule the same thing. 块大小为8x8时，它更快，因为它是32的较小倍数，如下面的图片所示，绑定了32个CUDA内核，并且两个不同的warp调度程序实际上调度了同一件事。 So the same instruction is executed on these 32 cores in each execution cycle. 因此，在每个执行周期中，在这32个内核上执行相同的指令。

To better clarify this, in the first case (8x8) each block is made of two warps (64 threads) so it is finished within only two execution cycles, however, when you are using (16x16) as your block size, each takes 8 warps (256 threads), therefore taking 4 times more execution cycles resulting in a slower compound. 为了更好地说明这一点，在第一种情况（8x8）中，每个块都由两个扭曲（64个线程）组成，因此仅在两个执行周期内完成，但是，当您使用（16x16）作为块大小时，每个块占用8扭曲（256个线程），因此执行时间增加4倍，导致复合速度变慢。

However, filling an SM with more warps is better in some cases, when memory access is high and each warp is likely to go into a memory stall (ie getting its operands from memory), then it will be replaced with another warp until the memory operation gets completed. 但是，在某些情况下，用更多的扭曲填充SM会更好，当内存访问量很高并且每个扭曲都可能进入内存停顿（即从内存获取其操作数）时，它将被另一个扭曲替换，直到内存被占用为止。操作完成。 Therefore resulting in more occupancy of the SM. 因此导致SM占用更多。

You should of course throw in the number of blocks per SM and number of SMs total in your calculations, for example, assigning more than 8 blocks to a single SM might reduce its occupancy, but probably in your case, you are not facing these issues, because 256 is generally a better number than 64, since it will balance your blocks among SMs whereas using 64 threads will result in more blocks getting executed in the same SM. 您当然应该在计算中投入每个SM的块数和SM总数，例如，为单个SM分配8个以上的块可能会降低其占用率，但是在您的情况下，您可能没有遇到这些问题，因为256通常比64好，因为它会在SM之间平衡您的块，而使用64线程将导致在同一SM中执行更多的块。

EDIT: This answer is based on my speculations, for a more scientific approach, see Greg Smiths answer. 编辑：此答案基于我的推测，有关更科学的方法，请参阅格雷格·史密斯（Greg Smiths）的答案。

Register pool is different from shared memory/cache, to the very bottom of their architecture! 寄存器池从共享内存/缓存到其体系结构的最底层都是不同的！

Registers are made of Flip-flops and L1 cache are probably SRAM . 寄存器由触发器组成，L1高速缓存可能是SRAM 。

Just to get an idea, look at the picture below which represents FERMI architecture, then update your question to further specify the problem you are facing. 为了获得一个想法，请看下面的图片，该图片代表FERMI体系结构，然后更新您的问题以进一步指定您面临的问题。

FERMI体系结构

As a note, you can see how many registers and shared memory (smem) are taken by your functions by passing the option --ptxas-options = -v to nvcc. 注意，将选项--ptxas-options = -v传递给nvcc，可以看到函数占用了多少寄存器和共享内存（smem）。

关于CUDA中每个SM分配的寄存器数量

问题描述

2 个解决方案

解决方案1
2 2013-03-19 17:59:59

解决方案2
1 2013-03-19 10:02:34

EDIT: This answer is based on my speculations, for a more scientific approach, see Greg Smiths answer. 编辑：此答案基于我的推测，有关更科学的方法，请参阅格雷格·史密斯（Greg Smiths）的答案。

关于CUDA中每个SM分配的寄存器数量

问题描述

2 个解决方案

解决方案1 2 2013-03-19 17:59:59

解决方案2 1 2013-03-19 10:02:34

EDIT: This answer is based on my speculations, for a more scientific approach, see Greg Smiths answer. 编辑：此答案基于我的推测，有关更科学的方法，请参阅格雷格·史密斯（Greg Smiths）的答案。

解决方案1
2 2013-03-19 17:59:59

解决方案2
1 2013-03-19 10:02:34