简体   繁体   English

为什么CUDA会舍入线程使用的寄存器数量?

[英]Why CUDA round up the number of registers used by thread?

I'm profiling a kernel that uses 25 registers per thread and 3568 bytes of shared memory per block in a GTX480. 我正在分析一个内核,它在GTX480中每个线程使用25个寄存器,每个块使用3568个字节的共享内存。 The kernel is configured to launch 16x16 threads and the thread cache preference is set to shared. 内核配置为启动16x16线程,并且线程缓存首选项设置为shared。

According to the specifications of the GTX480, the device has 32768 registers per SM, so It would be possible have 25 regs x 256 threads per block x 6 blocks per SM blocks running concurrently. 根据GTX480的规格,该器件每个SM有32768个寄存器,因此25 regs x 256 threads per block x 6 blocks per SM可能有25 regs x 256 threads per block x 6 blocks per SM块同时运行25 regs x 256 threads per block x 6 blocks per SM块。

However, the Compute Visual Profiler and the Cuda Occupancy Calculator report that only 4 blocks will be active per SM. 但是,Compute Visual Profiler和Cuda Occupancy Calculator报告每个SM只有4个块有效。 I was wondering why only 4 blocks would be active and not 5, as I expected. 我想知道为什么只有4块活动而不是5块,正如我预期的那样。

The reason I found is that CUDA round up the number of registers used to 26, in which case, the number of active block is 4. 我找到的原因是CUDA将用于26的寄存器数量向上舍入,在这种情况下,活动块的数量为4。

Why CUDA round up the number of registers? 为什么CUDA会对寄存器的数量进行舍入? Because with 25 registers per thread and 256 threads per block it would be possible have up to 5 blocks per SM, which is obviously an advantage. 因为每个线程有25个寄存器,每个块有256个线程,所以每个SM最多可以有5个块,这显然是一个优点。

Environment setup: 环境设置:

Device 0: "GeForce GTX 480"
CUDA Driver Version / Runtime Version          5.0 / 4.0
ptxas info: Compiling entry function '_Z13kernellS_PiS0_iiS0_' for 'sm_20'
ptxas info: Used 25 registers, 3568+0 bytes smem, 80 bytes cmem[0], 16 bytes cmem[2]
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
kernel config: 16x16 threads per block
kernel config: cudaFuncCachePreferShared

You haven't interpreted what is happening correctly. 你还没有正确解释发生了什么。 There is no rounding of the number of registers per thread happening here, there is rounding of the number of registers per warp . 这里每个线程的寄存器数量没有舍入, 每个warp的寄存器数量都会变为四舍五入。

Your GPU allocated registers on a per warp basis, with a register "page size" of 64 registers (note I use that term loosely, I am not privy to the precise register file design). 你的GPU在每个warp的基础上分配寄存器,寄存器“页面大小”为64个寄存器(注意我松散地使用该术语,我不熟悉精确的寄存器文件设计)。 In your case a warp requires 25*32 = 800 registers, which must be rounded up to the nearest "page size" of 64, giving 832 registers per warp. 在您的情况下,warp需要25 * 32 = 800个寄存器,必须向上舍入到最接近的“页面大小”64,每个warp给出832个寄存器。 Each block contains 8 warps (256 threads), so each block requires 6656 registers. 每个块包含8个warp(256个线程),因此每个块需要6656个寄存器。 The maximum number of blocks per SM for this kernel is then 32768 / 6656 rounded down to the nearest integer, ie. 然后,该内核的每个SM的最大块数为32768/6656,向下舍入到最接近的整数,即。 4 blocks per SM rather than the 5 you expect. 每个SM 4块而不是你期望的5块。

So the very short answer is register file allocation granularity and page size is dictating how many blocks can be run per SM in this case. 因此,非常简短的答案是寄存器文件分配粒度,页面大小决定了在这种情况下每个SM可以运行多少个块。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM