块和分配管理中的cuda最大数量

Question

i'm writing a CUDA kernel and I have to execute on this device: 我正在编写CUDA内核，并且必须在此设备上执行：

name: GeForce GTX 480
CUDA capability: 2.0
Total global mem:  1610285056
Total constant Mem:  65536
Shared mem per mp:  49152
Registers per mp:  32768
Threads in warp:  32
Max threads per block:  1024
Max thread dimensions:  (1024, 1024, 64)
Max grid dimensions:  (65535, 65535, 65535)

The kernel, in minimal form, is: 最小形式的内核是：

_global__ void CUDAvegas( ... )
{
devParameters p;
extern __shared__ double shared[];
int width = Ndim * Nbins;
int ltid = p.lId;
while(ltid < 2* Ndim){
shared[ltid+2*width] = ltid;
ltid += p.lOffset; //offset inside a block
}
__syncthreads();
din2Vec<double> lxp(Ndim, Nbins);

__syncthreads();
for(int i=0; i< Ndim; i++){
  for(int j=0; j< Nbins; j++){
    lxp.v[i][j] = shared[i*Nbins+j];
    }
}
}// end kernel

where Ndim=2, Nbins=128, devParameters is a class whose method p.lId is for counting the local thread's id (inside a block), and din2Cec is a class for creating a Vector of dim Ndim*Nbins whit the new command (in it's destructor I've implemented the corresponds delete[]). 其中Ndim = 2，Nbins = 128，devParameters是一个类，其方法p.lId用于计算本地线程的id（在一个块内），而din2Cec是用于创建昏暗的Vector的类Ndim * Nbins使用新命令（在它的析构函数中，我实现了对应的delete []）。 The nvcc output is: nvcc输出为：

nvcc -arch=sm_20   --ptxas-options=-v   file.cu -o file.x
ptxas info    : Compiling entry function '_Z9CUDAvegas4LockidiiPdS0_S0_P7sumAccuP17curandStateXORWOWS0_i' for 'sm_20'
ptxas info    : Function properties for  _Z9CUDAvegas4LockidiiPdS0_S0_P7sumAccuP17curandStateXORWOWS0_i
               0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 22 registers, 116 bytes cmem[0], 51200 bytes cmem[2]

the number of threads is compatible with the MultiProcessors limits: max Shared memory, max register per thread and MP and warps per MP. 线程数与MultiProcessors限制兼容：最大共享内存，每个线程和MP的最大寄存器数和每个MP的扭曲数。 If I launch 64 threads X 30 blocks (Shared Memory per Block is 4128), it's all right, but if use more than 30 block I obtain the error: 如果我启动64个线程X 30个块（每个块的共享内存为4128），那么可以，但是如果使用30个以上的块，则会出现错误：

cudaCheckError() failed at file.cu:508 : unspecified launch failure
========= Invalid __global__ read of size 8
=========     at 0x000015d0 in CUDAvegas
=========     by thread (0,0,0) in block (1,0,0)
=========     Address 0x200ffb428 is out of bounds

I think that's a problem in allocating single thread's memory, but I don't understand what's my limit per MP and for total blocks... Someone can help me or remind to a right topic? 我认为这是分配单线程内存的问题，但是我不明白每个MP和总块数的限制是多少...有人可以帮助我或提醒您一个正确的话题吗？

PS: I know the kernel presented do nothing, but It's just to understand my limit problems. PS：我知道介绍的内核没有任何作用，但这只是为了了解我的极限问题。

Answer 1

I think that the error you receive is explanatory. 我认为您收到的错误是可以解释的。 It is pointed out that there is an out-of-bounds global read of a datatype of size 8. The responsible for the out-of-bounds read is thread (0,0,0) in block (1,0,0). 要指出的是，有一个大小为8的数据类型的越界全局读取。负责越界读取的是块（1,0,0）中的线程（0,0,0）。。 I suspect that responsible instruction is lxp.v[i][j] = shared[i*Nbins+j]; 我怀疑负责任的指示是lxp.v[i][j] = shared[i*Nbins+j]; in the last nested for loop. 在最后一个嵌套的for循环中。 Probably you allocate an amount of global memory which is not related to the number of blocks you launch so that, when you launch too many blocks, you receive such an error. 可能您分配的全局内存量与启动的块数无关，因此，当您启动太多的块时，会收到这样的错误。

块和分配管理中的cuda最大数量

问题描述

1 个解决方案

解决方案1
1 2013-02-24 21:03:27

块和分配管理中的cuda最大数量

问题描述

1 个解决方案

解决方案1 1 2013-02-24 21:03:27

解决方案1
1 2013-02-24 21:03:27