简体   繁体   English

使用CUDA流和memCpyAsync错误的结果,添加cudaDeviceSynchronize成为正确的结果

[英]Wrong results using CUDA streams and memCpyAsync, become correct adding cudaDeviceSynchronize

I'm developing a CUDA matrix multiplication, but I did some modifications to observe how they affect performances. 我正在开发CUDA矩阵乘法,但是我做了一些修改以观察它们如何影响性能。

I'm trying to observe the behavior (and I'm measuring the changes in GPU events time) of a simple matrix multiplication kernel. 我正在尝试观察简单矩阵乘法内核的行为(并且正在测量GPU事件时间的变化)。 But I'm testing it in two speicific different conditions: 但是我正在两个特定的不同条件下对其进行测试:

  • I have an amount of matrices (say matN ) either for A, B and C, then I transfer (H2D) one matrix for A, one for B at time and then multply them, to transfer back (D2H) one C; 我对A,B和C有一定数量的矩阵(例如matN ),然后我将A的一个矩阵(H2D)转移(A),一次对B的一个矩阵(H2D),然后多次将它们转移回(D2H)一个C。

  • I have matN either for A, B and C, but I transfer >1(say chunk ) matrices at time for A and for B, perform exactly chunk multiplications, and transfer back chunk result matrices C. 我有A,B和C的matN ,但我同时为A和B传递了> 1(例如chunk )矩阵,精确地执行了chunk乘法,并向后传递了chunk结果矩阵C。

In the first case ( chunk = 1 ) all works as expected, but in the second case ( chunk > 1 ) I get some of Cs are correct, while others are wrong. 在第一种情况( chunk = 1 )中,所有功能均按预期工作,但是在第二种情况( chunk > 1 )中,我得到一些Cs是正确的,而其他Cs是错误的。

But if I put a cudaDeviceSynchronize() after the cudaMemcpyAsync all results I get are correct. 但是,如果将cudaDeviceSynchronize()放在cudaMemcpyAsync之后,则得到的所有结果都是正确的。

Here's the part of code doing what I've just described above: 这是执行上面刚刚描述的代码的一部分:


/**** main.cpp ****/

    int chunk = matN/iters;    
    #ifdef LOWPAR
        GRIDx= 1;
        GRIDy= 1;
        label="LOW";
    #else
       int sizeX = M;
       int sizeY = N;
       GRIDx = ceil((sizeX)/BLOCK);
       GRIDy = ceil((sizeY)/BLOCK);
       label="";
    #endif

    const int bytesA = M*K*sizeof(float);
    const int bytesB = K*N*sizeof(float);
    const int bytesC = M*N*sizeof(float);

    //device mem allocation
    float *Ad, *Bd, *Cd;
    gpuErrchk( cudaMalloc((void **)&Ad, bytesA*chunk) );
    gpuErrchk( cudaMalloc((void **)&Bd, bytesB*chunk) );
    gpuErrchk( cudaMalloc((void **)&Cd, bytesC*chunk) );
    //host pinned mem allocation
    float *A, *B, *C;
    gpuErrchk( cudaMallocHost((void **)&A, bytesA*matN) );
    gpuErrchk( cudaMallocHost((void **)&B, bytesB*matN) );
    gpuErrchk( cudaMallocHost((void **)&C, bytesC*matN) );

    //host data init
    for(int i=0; i<matN; ++i){
        randomMatrix(M, K, A+(i*M*K));
        randomMatrix(K, N, B+(i*K*N));
    } 

    //event start
    createAndStartEvent(&startEvent, &stopEvent);

    if (square)
    {          
        label += "SQUARE";
        int size = N*N;
        for (int i = 0; i < iters; ++i) { 
            int j = i%nStream;            
            int idx = i*size*chunk;
            newSquareMatMulKer(A+idx, B+idx, C+idx, Ad, Bd, Cd, N, chunk, stream[j]); 
        }
    }
    else {
        ...
    } 

    msTot = endEvent(&startEvent, &stopEvent);
    #ifdef MEASURES          
        printMeasures(square, label, msTot, millis.count(), matN, iters, devId);
    #else
        float *_A, *_B, *_C, *tmpC;
        tmpC = (float *)calloc(1,bytesC*chunk);
        for (int s=0; s<matN; ++s)
        {
            _A = A+(s*M*K);
            _B = B+(s*K*N);
            _C = C+(s*M*N);
            memset(tmpC, 0, bytesC*chunk);

            hostMatMul(_A, _B, tmpC, M, K, N);
            checkMatEquality(_C, tmpC, M, N);
        }   
    #endif


/**** matmul.cu ****/

__global__ void squareMatMulKernel(float* A, float* B, float* C, int N, int chunk) {

    int ROW = blockIdx.x*blockDim.x+threadIdx.x;
    int COL = blockIdx.y*blockDim.y+threadIdx.y;


    if (ROW<N && COL<N) {
        int size=N*N;
        int offs = 0;
        float tmpSum=0.0f;

        for (int s=0; s<chunk; ++s)
        {
            offs = s*size;
            tmpSum = 0.0f;

            for (int i = 0; i < N; ++i) {
                tmpSum += A[offs+(ROW*N)+i] * B[offs+(i*N)+COL];
            }

            C[offs+(ROW*N)+COL] = tmpSum;
        }
    }
    return ;
}




void newSquareMatMulKer(float *A, float *B, float *C, float *Ad, float *Bd, float *Cd, 
            int n, int chunk, cudaStream_t strm)
{
    int size = n*n;
    int bytesMat = size*sizeof(float);

    dim3 dimBlock(BLOCK,BLOCK,1);
    dim3 dimGrid(GRIDx, GRIDy,1); 

    gpuErrchk( cudaMemcpyAsync(Ad, A, bytesMat*chunk, cudaMemcpyHostToDevice, strm) );    
    gpuErrchk( cudaMemcpyAsync(Bd, B, bytesMat*chunk, cudaMemcpyHostToDevice, strm) );   

    #ifdef LOWPAR
        squareMatMulGridStrideKer<<<dimGrid, dimBlock, 0, strm>>>(Ad, Bd, Cd, n, chunk);
    #else
        squareMatMulKernel<<<dimGrid, dimBlock, 0, strm>>>(Ad, Bd, Cd, n, chunk);
    #endif
    squareMatMulKernel<<<dimGrid, dimBlock, 0, strm>>>(Ad, Bd, Cd, n, chunk);

    gpuErrchk( cudaMemcpyAsync( C, Cd, bytesMat*chunk, cudaMemcpyDeviceToHost, strm) );

    cudaDeviceSynchronize();
        ^ ^ ^ ^ ^ ^
}


I tried to debug using cuda-gdb but nothing strange showed up, gpuErrchk doesn't throw any error in CUDA API calls. 我尝试使用cuda-gdb进行调试,但没有发现任何奇怪的情况, gpuErrchk在CUDA API调用中未引发任何错误。 I run the code using memcheck too, both in the case with and without cudaDeviceSynchronize and in both cases I get no error. 在有和没有cudaDeviceSynchronize情况下,我也都使用memcheck来运行代码,在两种情况下我都不会出错。

I think I can state it's a synchronization issue, but I can't understand what is the reason behind that. 我想我可以说这是一个同步问题,但我不明白其背后的原因是什么。 Can someone spot where I'm going wrong? 有人可以发现我要去哪里了吗? Other code style advices are really appreciated too. 其他代码样式建议也非常感谢。

If you are using multiples streams, you may override Ad and Bd before using them. 如果使用多重流,则可以在使用它们之前覆盖AdBd

Example with iters = 2 and nStream = 2 : iters = 2nStream = 2示例:

for (int i = 0; i < iters; ++i) { 
  int j = i%nStream;            
  int idx = i*size*chunk;
  newSquareMatMulKer(A+idx, B+idx, C+idx, Ad, Bd, Cd, N, chunk, stream[j]); 
}

From this loop, you will call 在此循环中,您将调用

newSquareMatMulKer(A, B, C, Ad, Bd, Cd, N, chunk, stream[0]); // call 0
newSquareMatMulKer(A+idx, B+idx, C+idx, Ad, Bd, Cd, N, chunk, stream[1]); // call 1

As you are using the same memory area on device for both call, you may have several synchronizations issues: 由于您要在两个呼叫的设备上使用相同的存储区,因此可能会遇到一些同步问题:

  • call 1 start to copy A and B on device before call 0:squareMatMulKernel end, so you may use incorrect values of A and/or B to compute your first iteration. call 1call 0:squareMatMulKernel结束之前开始在设备上复制AB ,因此您可能使用不正确的A和/或B来计算您的第一次迭代。

  • call 1:squareMatMulKernel start before you retrieve the values of C from call 0, so you may override C with values from call 1 . call 1:squareMatMulKernel在您从调用0检索C的值之前开始,因此您可以使用call 1值覆盖C

To fix this problem, I see two approaches: 要解决此问题,我看到了两种方法:

  • Using synchronization as in your example with cudaDeviceSynchronize(); 在您的示例中使用cudaDeviceSynchronize(); .

  • You can allocate more memory two device side (one workspace per stream), for example. 例如,您可以在两个设备端(每个流一个工作区)分配更多的内存。

'' ''

//device mem allocation
float *Ad, *Bd, *Cd;
gpuErrchk( cudaMalloc((void **)&Ad, bytesA*chunk*nStream) );
gpuErrchk( cudaMalloc((void **)&Bd, bytesB*chunk*nStream) );
gpuErrchk( cudaMalloc((void **)&Cd, bytesC*chunk*nStream) );

/* code here */

for (int i = 0; i < iters; ++i) { 
  int j = i%nStream;            
  int idx = i*size*chunk;
  int offset_stream = j*size*chunk;
  newSquareMatMulKer(A+idx, B+idx, C+idx, 
    Ad + offset_stream , 
    Bd + offset_stream , 
    Cd + offset_stream , N, chunk, stream[j]); 
}

In this case you don't need synchronization before the end of the loop. 在这种情况下,循环结束之前不需要同步。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM