简体   繁体   English

Cuda Kernel无法启动

[英]Cuda Kernel Fails to launch

Here is my code. 这是我的代码。 I have an array of (x,y) pairs. 我有一个(x,y)对数组。 I want to calculate for each co-ordinate the farthest point. 我想为每个坐标计算最远的点。

#define GPUERRCHK(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

__device__ float computeDist( float x1, float y1, float x2, float y2 )
{
    float delx = x2 - x1;
    float dely = y2 - y1;
    return sqrt( delx*delx + dely*dely );
}

__global__ void kernel( float * x, float * y, float * dev_dist_sum, int N )
{
    int tid = blockIdx.x*gridDim.x + threadIdx.x;
    float a = x[tid];  //............(alpha)
    float b = y[tid];  //............(beta)
    if( tid < N )
    {
    float maxDist = -1;
    for( int k=0 ; k<N ; k++ )
    {
        //float dist = computeDist( x[tid], y[tid], x[k], y[k] ); //....(gamma)
        float dist = computeDist( a, b, x[k], y[k] );             //....(delta)
        if( dist > maxDist )
        maxDist = dist; 
    }
    dev_dist_sum[tid] = maxDist;
    }
}

int main()
{
.
.

    kernel<<<(N+31)/32,32>>>( dev_x, dev_y, dev_dist_sum, N );
    GPUERRCHK( cudaPeekAtLastError() );
    GPUERRCHK( cudaDeviceSynchronize() );

.
.

}

I have a NVidia GeForce 420M. 我有NVidia GeForce 420M。 I have verified that cuda works with it on my computer. 我已验证cuda可以在我的计算机上使用。 When I run the above mentioned code for N = 50000, the kernel fails to launch throwing out the error message "unspecified error message". 当我为N = 50000运行上述代码时,内核无法启动,抛出错误消息“ unspecified error message”。 However it seems to work fine for a smaller value like 10000. 但是,对于较小的值(例如10000),它似乎可以正常工作。

Also, if I comment out alpha, beta, delta (see marking in the code) and uncomment gamma, the code works even for a large value of N like 50000 or 100000. 另外,如果我注释掉alpha,beta,delta(请参阅代码中的标记)和取消注释gamma,则该代码即使对于较大的N(如50000或100000)也可以工作。

I want to use alpha and beta so as to reduce memory traffic by use of thread memory more instead of global memory. 我想使用alpha和beta,以便通过更多使用线程内存而不是全局内存来减少内存流量。

How do I sort this issue? 如何解决此问题?

@mkuse. @mkuse。 gridDim can be visualized as a 2-D spatial arrangement of thread blocks in a grid and blockDim is a 3-D spatial arrangements of threads. gridDim可以可视化为网格中线程块的2D空间排列,而blockDim是线程的3D空间排列。 For instance, dim3 gridDim(2,3,1) means 2 thread blocks in the x direction and 3 thread blocks in the y direction. 例如,dim3 gridDim(2,3,1)表示x方向上的2个线程块和y方向上的3个线程块。 The maximum you can go is 65536 = 2^16. 您可以使用的最大值是65536 = 2 ^ 16。 dim3 blockDim(32,16,1) is at the thread granularity. dim3 blockDim(32,16,1)为线程粒度。 32 threads in the x direction and 16 threads in the y direction making up for 512 threads in total. x方向上有32个线程,y方向上有16个线程,总共512个线程。 You can access each thread with a thread id. 您可以使用线程ID访问每个线程。 However since you have multiple blocks, you would have to identify threads with the respective blockdims and griddims. 但是,由于有多个块,因此必须使用各自的blockdims和griddims来标识线程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM