尝试写入使用cudaMalloc3D分配的2D数组时，出现“非法内存访问”

Question

I am trying to allocate and copy memory of a flattened 2D array on to the device using cudaMalloc3D to test the performance of cudaMalloc3D. 我正在尝试使用cudaMalloc3D分配展平的2D数组的内存并将其复制到设备上，以测试cudaMalloc3D的性能。 But when I try to write to the array from the kernel it throws 'an illegal memory access was encountered' exception. 但是，当我尝试从内核写入数组时，它抛出“遇到非法内存访问”异常。 The program runs fine if I am just reading from the array but when I try to write to it, there is an error. 如果我只是从数组中读取数据，则该程序运行良好，但是当我尝试对其进行写入时，会出现错误。 Any help on this will be greatly appreciated. 任何帮助，将不胜感激。 Below is my code and the syntax for compiling the code. 以下是我的代码和编译代码的语法。

Compile using 编译使用

nvcc -O2 -arch sm_20 test.cu

Code: test.cu 程式码：test.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define PI 3.14159265 
#define NX 8192     /* includes boundary points on both end */
#define NY 4096     /* includes boundary points on both end */
#define NZ 1        /* needed for cudaMalloc3D */

#define N_THREADS_X 16
#define N_THREADS_Y 16
#define N_BLOCKS_X NX/N_THREADS_X 
#define N_BLOCKS_Y NY/N_THREADS_Y 

#define LX 4.0    /* length of the domain in x-direction  */
#define LY 2.0    /* length of the domain in x-direction  */
#define dx       (REAL) ( LX/( (REAL) (NX) ) )
#define cSqrd     5.0
#define dt       (REAL) ( 0.4 * dx / sqrt(cSqrd) )
#define FACTOR   ( cSqrd * (dt*dt)/(dx*dx) )

#define IC  (i + j*NX)       /* (i,j)   */
#define IM1 (i + j*NX - 1)   /* (i-1,j) */
#define IP1 (i + j*NX + 1)   /* (i+1,j) */
#define JM1 (i + (j-1)*NX)   /* (i,j-1) */
#define JP1 (i + (j+1)*NX)   /* (i,j+1) */


// Macro for checking CUDA errors following a CUDA launch or API call
#define cudaCheckError() {\
  cudaError_t e = cudaGetLastError();\
  if( e != cudaSuccess ) {\
    printf("\nCuda failure %s:%d: '%s'\n",__FILE__,__LINE__,cudaGetErrorString(e));\
    exit(EXIT_FAILURE);\
  }\
}

typedef double REAL;
typedef int    INT;


void meshGrid ( REAL *x, REAL *y )
{

  INT i,j;
  REAL a;
  for (j=0; j<NY; j++) {
    a = dx * ( (REAL) j );
    for (i=0; i<NX; i++) {
      x[IC] =  dx * ( (REAL) i );
      y[IC] = a;
    }
  }
}


void initWave ( REAL *u, REAL *uold, REAL *x, REAL *y )
{                    
  INT i,j;
  for (j=1; j<NY-1; j++) {
    for (i=1; i<NX-1; i++) {
      u[IC] =  0.1 * (4.0*x[IC]-x[IC]*x[IC]) * ( 2.0*y[IC] - y[IC]*y[IC] );
    }
  }

  for (j=1; j<NY-1; j++) {
    for (i=1; i<NX-1; i++) {
      uold[IC] = u[IC] + 0.5*FACTOR*( u[IP1] + u[IM1] + u[JP1] + u[JM1] - 4.0*u[IC] );
    }
  }
}


__global__ void solveWaveGPU ( cudaPitchedPtr uold, cudaPitchedPtr u, cudaPitchedPtr unew )
{

 INT i,j;

 i = blockIdx.x*blockDim.x + threadIdx.x;
 j = blockIdx.y*blockDim.y + threadIdx.y;

 if (i>0 && i < (NX-1) && j>0 && j < (NY-1) ) {

  char *unewPtr  = (char *) unew.ptr;
  REAL *unew_row = (REAL *) (unewPtr + i * unew.pitch);

  REAL tmp = unew_row[j]; // no error on this line
  unew_row[j] = 1.2; // this is where I get the error
 }

}


INT main(INT argc, char *argv[])
{

  INT nTimeSteps = 10;  

  // pointers for the host side
  REAL *unew, *u, *uold, *uFinal, *x, *y;

  // allocate memory on the host
  unew        = (REAL *)calloc(NX*NY,sizeof(REAL));
  u           = (REAL *)calloc(NX*NY,sizeof(REAL));
  uold        = (REAL *)calloc(NX*NY,sizeof(REAL));
  uFinal      = (REAL *)calloc(NX*NY,sizeof(REAL));
  x           = (REAL *)calloc(NX*NY,sizeof(REAL));
  y           = (REAL *)calloc(NX*NY,sizeof(REAL));


  // pointer for the device side
  size_t pitch = NX * sizeof(REAL);
  cudaPitchedPtr  d_u, d_uold, d_unew, d_tmp;
  cudaExtent myExtent = make_cudaExtent(pitch, NY, NZ);

  // allocate 3D memory on the device
  cudaMalloc3D( &d_u, myExtent );    cudaCheckError();
  cudaMalloc3D( &d_uold, myExtent ); cudaCheckError();
  cudaMalloc3D( &d_unew, myExtent ); cudaCheckError();


  // initialize grid and wave
  meshGrid( x, y );
  initWave( u, uold, x, y );


  // copy host memory to 3D device memory
  cudaMemcpy3DParms cpy3D = { 0 };
  cpy3D.kind = cudaMemcpyHostToDevice;

  // copying u to d_u
  cpy3D.srcPtr = make_cudaPitchedPtr(u, pitch, NX, NY);
  cpy3D.dstPtr = d_u;
  cpy3D.extent = myExtent;
  cudaMemcpy3D( &cpy3D ); cudaCheckError();  

  // copying uold to d_uold
  cpy3D.srcPtr = make_cudaPitchedPtr(uold, pitch, NX, NY);
  cpy3D.dstPtr = d_uold;
  cpy3D.extent = myExtent;
  cudaMemcpy3D( &cpy3D ); cudaCheckError();  


  //  set up the GPU grid/block model
  dim3 dimGrid  ( N_BLOCKS_X , N_BLOCKS_Y  );
  dim3 dimBlock ( N_THREADS_X, N_THREADS_Y );

  for ( INT n = 1; n < nTimeSteps + 1; n++ ) {
    solveWaveGPU <<< dimGrid, dimBlock >>> ( d_uold, d_u, d_unew );
    cudaThreadSynchronize();
    cudaCheckError();

    d_tmp  = d_uold;
    d_uold = d_u;
    d_u    = d_unew;
    d_unew = d_tmp;
  }

  // copy the memory back to host
  cpy3D.kind = cudaMemcpyDeviceToHost;

  // copying d_unew to uFinal
  cpy3D.srcPtr = d_unew;
  cpy3D.dstPtr = make_cudaPitchedPtr(uFinal, pitch, NX, NY);
  cpy3D.extent = myExtent;
  cudaMemcpy3D( &cpy3D ); cudaCheckError();  

  free(u);    cudaFree(d_u.ptr);
  free(unew); cudaFree(d_unew.ptr);
  free(uold); cudaFree(d_uold.ptr);

  free(uFinal); free(x); free(y);

  return EXIT_SUCCESS;
}

Answer 1

The reason the error doesn't occur on this line: 该行未发生错误的原因：

REAL tmp = unew_row[j]; // no error on this line

is because the compiler is optimizing that line out. 是因为编译器正在优化该行。 It doesn't do anything useful, and so the compiler completely eliminates it. 它没有任何用处，因此编译器完全消除了它。 The compiler warning: 编译器警告：

xxx.cu(87): warning: variable "tmp" was declared but never referenced

is a hint to that effect. 暗示了这种效果。

Your code is very nearly correct. 您的代码几乎是正确的。 The issue is here: 问题在这里：

REAL *unew_row = (REAL *) (unewPtr + i * unew.pitch);

It should be: 它应该是：

REAL *unew_row = (REAL *) (unewPtr + j * unew.pitch);

The i variable in your kernel is the width (ie X) dimension. 内核中的i变量是宽度（即X）尺寸。 The j variable is the height (ie Y) dimension. j变量是高度（即Y）尺寸。

The height is the one that refers to which row you are on, therefore the row pitch should be multiplied by the height parameter, ie j , not i . 高度是指您所在的行的高度，因此，行距应乘以height参数，即j ，而不是i 。

Similarly, although it's not the source of the specific failure for your particular dimensions, this code may be not what you intended either: 同样，尽管它不是您特定尺寸的特定故障的根源，但此代码也可能不是您想要的：

REAL tmp = unew_row[j]; // no error on this line
unew_row[j] = 1.2; // this is where I get the error

If, for example, you were intending to compute the offset to the row and then index into the row (perhaps to set every element in the alocation, for example) then I think you would want to use i not j as your final index: 例如，如果您打算计算到行的偏移量，然后索引到行中（例如，可能要设置分配中的每个元素），那么我认为您想使用i not j作为最终索引：

REAL tmp = unew_row[i]; // no error on this line
unew_row[i] = 1.2; // this is where I get the error

However, for this particular example, this is not the actual source of the illegal memory access. 但是，对于此特定示例，这不是非法内存访问的实际来源。

尝试写入使用cudaMalloc3D分配的2D数组时，出现“非法内存访问”

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-06-17 19:11:18

尝试写入使用cudaMalloc3D分配的2D数组时，出现“非法内存访问”

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-06-17 19:11:18

解决方案1
2 已采纳 2015-06-17 19:11:18