CUDA双精度和每个线程的寄存器数量

Question

I am having an error while executing the kernel 执行内核时出现错误

too many resources requested for launch

I checked online for any hints on error message, which suggest this happens due to usage of more registers than the limit specified by the GPU for each multi-processsor. 我在网上检查了有关错误消息的任何提示，这表明发生这种情况是由于使用了比GPU为每个多处理器指定的限制更多的寄存器。 Device query results as follows: 设备查询结果如下：

Device 0: "GeForce GTX 470"
CUDA Driver Version / Runtime Version          5.0 / 5.0
CUDA Capability Major/Minor version number:    2.0
Total amount of global memory:                 1279 MBytes (1341325312 bytes)
(14) Multiprocessors x ( 32) CUDA Cores/MP:    448 CUDA Cores
GPU Clock rate:                                1215 MHz (1.22 GHz)
Memory Clock rate:                             1674 Mhz
Memory Bus Width:                              320-bit
L2 Cache Size:                                 655360 bytes
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 32768
Warp size:                                     32
Maximum number of threads per multiprocessor:  1536
Maximum number of threads per block:           1024
Maximum sizes of each dimension of a block:    1024 x 1024 x 64
Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535

Update Robert Crovella remarked that he had no problems in running the code, so I paste here the complete code snippet for execution. 更新 Robert Crovella表示他在运行代码方面没有问题，因此我在此处粘贴了完整的代码片段以供执行。

Complete code looks like this: 完整的代码如下所示：

__global__ void calc_params(double *d_result_array, int total_threads) {

        int thread_id             = threadIdx.x + (blockDim.x * threadIdx.y);
        d_result_array[thread_id] = 1 / d_result_array[thread_id];

 }

  void calculate() {

     double *h_array;
     double *d_array;

     size_t array_size = pow((double)31, 2) * 2 * 10;

     h_array = (double *)malloc(array_size * sizeof(double));
     cudaMalloc((void **)&d_array, array_size * sizeof(double));

     for (int i = 0; i < array_size; i++) {
        h_array[i] = i;
     }

     cudaMemcpy(d_array, h_array, array_size * sizeof(double), cudaMemcpyHostToDevice);

     int BLOCK_SIZE = 1024;
     int NUM_OF_BLOCKS = (array_size / BLOCK_SIZE) + (array_size % BLOCK_SIZE)?1:0;

     calc_params<<<NUM_OF_BLOCKS, BLOCK_SIZE>>>(d_array, array_size);
     cudaDeviceSynchronize();
     checkCudaErrors(cudaGetLastError());

     cudaFree(d_array);
     free(h_array);

}

When I execute this code, I get the error as, too many resources requested for launch 当我执行此代码时，出现错误，因为启动所需的资源过多

While instead of using the inverse statement in the kernel 而不是在内核中使用逆语句
(ie d_result_array[thread_id] = 1 / d_result_array[thread_id]) （即d_result_array [thread_id] = 1 / d_result_array [thread_id]）
the equate statement works perfectly 平等的陈述完美地工作
(ie d_result_array[thread_id] = d_result_array[thread_id] * 200) . （即d_result_array [thread_id] = d_result_array [thread_id] * 200）。

Why? 为什么？ Is there any possible alternative to that (other than using a smaller block size). 有没有其他替代方法（除了使用较小的块大小）。 If thats the only solution, how shall I know what should be the block size that can work. 如果那是唯一的解决方案，我将如何知道可以工作的块大小。

Regards, 问候，

PS For those who are might wanna know whats cudaCheckErrors is PS对于那些可能想知道cudaCheckErrors是什么的人

#define checkCudaErrors(val) check( (val), #val, __FILE__, __LINE__)

template<typename T>
void check(T err, const char* const func, const char* const file, const int line) {
  if (err != cudaSuccess) {
    std::cerr << "CUDA error at: " << file << ":" << line << std::endl;
    std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
    exit(1);
  }
}

Build and OS Information 内部版本和操作系统信息

Build of configuration Debug for project TEST

make all 
Building file: ../test_param.cu
Invoking: NVCC Compiler
nvcc -G -g -O0 -gencode arch=compute_20,code=sm_20 -odir "" -M -o "test_param.d" "../test_param.cu"
nvcc --compile -G -O0 -g -gencode arch=compute_20,code=compute_20 -gencode arch=compute_20,code=sm_20  -x cu -o  "test_param.o" "../test_param.cu"
Finished building: ../test_param.cu

Building target: TEST
Invoking: NVCC Linker
nvcc  -link -o  "TEST"  ./test_param.o   
Finished building target: TEST

Operating System 操作系统

Ubuntu Lucid (10.04.4) 64bit
Linux paris 2.6.32-46-generic #105-Ubuntu SMP Fri Mar 1 00:04:17 UTC 2013 x86_64 GNU/Linux

Error I receive 我收到错误

CUDA error at: ../test_param.cu:42
too many resources requested for launch cudaGetLastError()

Answer 1

This seems to be an artifact of the compiler. 这似乎是编译器的产物。 The problem seems to be the register usage, which you can observe by passing the -Xptxas -v option on the nvcc command line. 问题似乎是寄存器使用情况，您可以通过在nvcc命令行上传递-Xptxas -v选项来观察。 For some reason the -G version of the code uses quite a bit more registers (per thread) than the regular code. 由于某种原因， -G版本的代码使用的寄存器（每个线程）比常规代码多得多。 You have a few options: 您有几种选择：

Don't use the -G switch. 不要使用-G开关。 This switch should only be used for debug purposes anyway, as it generates code that may run slower than without the -G switch. 无论如何，此开关仅应用于调试目的，因为它生成的代码可能比没有-G开关时运行得慢。
If you want to use the -G switch, then reduce the number of threads per block. 如果要使用-G开关，则减少每个块的线程数。 For the example in this case, I was able to get it to run with 768 threads per block or less. 对于这种情况下的示例，我能够使它以每个块或更少的768个线程运行。
Instruct the compiler to use fewer registers per thread. 指示编译器每个线程使用更少的寄存器。 You can do this with the -maxrregcount switch, such as: 您可以使用-maxrregcount开关来执行此操作，例如：
```
 nvcc -Xptxas -v -arch=sm_20 -G -maxrregcount=20 -o t145 t145.cu 
```

The objective in this last case is to have the (registers per thread * threads per block) be less than the max registers per block for the GPU in use. 最后一种情况的目的是使（每个线程的寄存器*每个块的线程）小于所使用GPU的每个块的最大寄存器。 A typical CC 2.0 GPU has a maximum of 32768 registers available per block (which you can discover with the deviceQuery sample ). 典型的CC 2.0 GPU每块最多具有32768个寄存器（您可以通过deviceQuery示例找到）。

CUDA双精度和每个线程的寄存器数量

问题描述

1 个解决方案

解决方案1
2 已采纳 2013-05-03 23:18:56

CUDA双精度和每个线程的寄存器数量

问题描述

1 个解决方案

解决方案1 2 已采纳 2013-05-03 23:18:56

解决方案1
2 已采纳 2013-05-03 23:18:56