简体繁体 English

Cuda按线程注册

[英]Cuda registers per thread

原文 2013-07-09 16:24:17 5 1 cuda/ local-storage/ profiler

As I understand correctly for the 2.x compute capability devices there's a 63 register limit per thread. 正如我对2.x计算能力设备的正确理解，每个线程有63个寄存器限制。 Do you know which is the register limit per thread for devices of compute capability 1.3? 您知道计算能力1.3的设备每个线程的寄存器限制是多少？

I have a big kernel which I'm testing on a GTX260. 我有一个很大的内核，我正在GTX260上测试。 I'm pretty sure I'm using a lot of registers since the kernel is very complex and I need a lot of local variables. 我很确定我使用了很多寄存器，因为内核非常复杂，我需要很多局部变量。 According to the Cuda profiler my register usage is 63 (Static Smem is 68 although I'm not so sure what that means and dynamic Smem is 0), although I'm pretty sure I have more than 63 local variables, so I figured the compiler is reusing registers or spilling them into local memory. 根据Cuda探查器我的注册用法是63（静态Smem是68虽然我不太确定这意味着什么，动态Smem是0），虽然我很确定我有63个以上的局部变量，所以我认为编译器重用寄存器或将它们溢出到本地存储器中。

Now I thought the devices of compute capability 1.3 had a higher limit of registers per thread than the 2.x devices. 现在我认为计算能力1.3的设备每个线程的寄存器限制比2.x设备更高。 My guess was that the compiler was choosing the 63 limit because I'm using using blocks of 256 threads in which case 256*63 is 16128 while 256*64 is 16384 which is the limit number of registers for a SM of this device. 我的猜测是编译器选择了63限制，因为我使用256个线程的块，在这种情况下256 * 63是16128而256 * 64是16384，这是该设备的SM的寄存器限制数。 So my guess was that if I lower the number of threads per block I can increase the number of registers in use. 所以我的猜测是，如果我减少每个块的线程数量，我可以增加使用的寄存器数量。 So I ran the kernel with blocks of 196 threads. 所以我使用196个线程的块运行内核。 But again the profiler shows 63 registers even though 63*192 is 12096 and 64*192 is 12288 which is way inside the 16384 limit of the SM. 但同样，分析器显示63个寄存器，即使63 * 192是12096而64 * 192是12288，这是在SM的16384限制内。

So any idea why the compiler is limiting itself still to 63 registers? 所以任何想法为什么编译器仍然限制自己63个寄存器？ Could it be all because of register reuse or is it still spilling registers? 可能是因为寄存器重用还是仍在溢出寄存器？

1 个解决方案

max registers per thread is documented here 每个线程的最大寄存器记录在此处

It is 63 for cc 2.x and 3.0, 128 for cc 1.x and 255 for cc 3.5 cc 2.x和3.0为63，cc 1.x为128，cc 3.5为255

The compiler may have decided that 63 registers is enough, and doesn't have use for additional registers. 编译器可能已经确定63个寄存器就足够了，并且不能用于其他寄存器。 Registers can be reused, so just because you have a lot of local variables, doesn't necessarily mean that the registers per thread has to be high. 寄存器可以重用，因为你有很多局部变量，并不一定意味着每个线程的寄存器必须很高。

My suggestion would be to use the nvcc -maxrregcount option to specify various limits, and then use the -Xptxas -v option to have the compiler tell you how many registers it is using when it creates the PTX. 我的建议是使用nvcc -maxrregcount 选项指定各种限制，然后使用-Xptxas -v 选项让编译器告诉你在创建PTX时它使用了多少个寄存器。