CUDA双精度和每个线程的寄存器数量

Question

执行内核时出现错误

too many resources requested for launch

我在网上检查了有关错误消息的任何提示，这表明发生这种情况是由于使用了比GPU为每个多处理器指定的限制更多的寄存器。 设备查询结果如下：

Device 0: "GeForce GTX 470"
CUDA Driver Version / Runtime Version          5.0 / 5.0
CUDA Capability Major/Minor version number:    2.0
Total amount of global memory:                 1279 MBytes (1341325312 bytes)
(14) Multiprocessors x ( 32) CUDA Cores/MP:    448 CUDA Cores
GPU Clock rate:                                1215 MHz (1.22 GHz)
Memory Clock rate:                             1674 Mhz
Memory Bus Width:                              320-bit
L2 Cache Size:                                 655360 bytes
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 32768
Warp size:                                     32
Maximum number of threads per multiprocessor:  1536
Maximum number of threads per block:           1024
Maximum sizes of each dimension of a block:    1024 x 1024 x 64
Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535

更新 Robert Crovella表示他在运行代码方面没有问题，因此我在此处粘贴了完整的代码片段以供执行。

完整的代码如下所示：

__global__ void calc_params(double *d_result_array, int total_threads) {

        int thread_id             = threadIdx.x + (blockDim.x * threadIdx.y);
        d_result_array[thread_id] = 1 / d_result_array[thread_id];

 }

  void calculate() {

     double *h_array;
     double *d_array;

     size_t array_size = pow((double)31, 2) * 2 * 10;

     h_array = (double *)malloc(array_size * sizeof(double));
     cudaMalloc((void **)&d_array, array_size * sizeof(double));

     for (int i = 0; i < array_size; i++) {
        h_array[i] = i;
     }

     cudaMemcpy(d_array, h_array, array_size * sizeof(double), cudaMemcpyHostToDevice);

     int BLOCK_SIZE = 1024;
     int NUM_OF_BLOCKS = (array_size / BLOCK_SIZE) + (array_size % BLOCK_SIZE)?1:0;

     calc_params<<<NUM_OF_BLOCKS, BLOCK_SIZE>>>(d_array, array_size);
     cudaDeviceSynchronize();
     checkCudaErrors(cudaGetLastError());

     cudaFree(d_array);
     free(h_array);

}

当我执行此代码时，出现错误，因为启动所需的资源过多

而不是在内核中使用逆语句
（即d_result_array [thread_id] = 1 / d_result_array [thread_id]）
平等的陈述完美地工作
（即d_result_array [thread_id] = d_result_array [thread_id] * 200）。

为什么？ 有没有其他替代方法（除了使用较小的块大小）。 如果那是唯一的解决方案，我将如何知道可以工作的块大小。

问候，

PS对于那些可能想知道cudaCheckErrors是什么的人

#define checkCudaErrors(val) check( (val), #val, __FILE__, __LINE__)

template<typename T>
void check(T err, const char* const func, const char* const file, const int line) {
  if (err != cudaSuccess) {
    std::cerr << "CUDA error at: " << file << ":" << line << std::endl;
    std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
    exit(1);
  }
}

内部版本和操作系统信息

Build of configuration Debug for project TEST

make all 
Building file: ../test_param.cu
Invoking: NVCC Compiler
nvcc -G -g -O0 -gencode arch=compute_20,code=sm_20 -odir "" -M -o "test_param.d" "../test_param.cu"
nvcc --compile -G -O0 -g -gencode arch=compute_20,code=compute_20 -gencode arch=compute_20,code=sm_20  -x cu -o  "test_param.o" "../test_param.cu"
Finished building: ../test_param.cu

Building target: TEST
Invoking: NVCC Linker
nvcc  -link -o  "TEST"  ./test_param.o   
Finished building target: TEST

操作系统

Ubuntu Lucid (10.04.4) 64bit
Linux paris 2.6.32-46-generic #105-Ubuntu SMP Fri Mar 1 00:04:17 UTC 2013 x86_64 GNU/Linux

我收到错误

CUDA error at: ../test_param.cu:42
too many resources requested for launch cudaGetLastError()

Answer 1

这似乎是编译器的产物。 问题似乎是寄存器使用情况，您可以通过在nvcc命令行上传递-Xptxas -v选项来观察。 由于某种原因， -G版本的代码使用的寄存器（每个线程）比常规代码多得多。 您有几种选择：

不要使用-G开关。 无论如何，此开关仅应用于调试目的，因为它生成的代码可能比没有-G开关时运行得慢。
如果要使用-G开关，则减少每个块的线程数。 对于这种情况下的示例，我能够使它以每个块或更少的768个线程运行。
指示编译器每个线程使用更少的寄存器。 您可以使用-maxrregcount开关来执行此操作，例如：
```
 nvcc -Xptxas -v -arch=sm_20 -G -maxrregcount=20 -o t145 t145.cu 
```

最后一种情况的目的是使（每个线程的寄存器*每个块的线程）小于所使用GPU的每个块的最大寄存器。 典型的CC 2.0 GPU每块最多具有32768个寄存器（您可以通过deviceQuery示例找到）。

CUDA双精度和每个线程的寄存器数量

问题描述

1 个解决方案

解决方案1
2 已采纳 2013-05-03 23:18:56

CUDA双精度和每个线程的寄存器数量

问题描述

1 个解决方案

解决方案1 2 已采纳 2013-05-03 23:18:56

解决方案1
2 已采纳 2013-05-03 23:18:56