CUDA并发内核启动不起作用

Question

I'm writing a CUDA program for image processing. 我正在编写用于图像处理的CUDA程序。 Same kernel "processOneChannel" will be launched for RGB channels. 将为RGB通道启动相同的内核“ processOneChannel”。

Below I try to specify streams for the three kernel launches so they can be processed concurrently. 在下面，我尝试为三个内核启动指定流，以便可以同时处理它们。 But nvprof says they are still launched one after another... 但是nvprof表示，他们仍然一个接一个地启动。

There are two other kernels before and after these three, and I don't want them to run concurrently. 在这三个内核之前和之后还有另外两个内核，我不希望它们同时运行。

Basically I want the following: seperateChannels --> processOneChannel(x3) --> recombineChannels 基本上，我需要以下内容：sepeparateChannels-> processOneChannel（x3）-> recombineChannels

Please advice what I did wrong.. 请指教我做错了..

void kernelLauncher(const ushort4 * const h_inputImageRGBA, ushort4 * const d_inputImageRGBA,
                        ushort4* const d_outputImageRGBA, const size_t numRows, const size_t numCols,
                        unsigned short *d_redProcessed, 
                        unsigned short *d_greenProcessed, 
                        unsigned short *d_blueProcessed,
                        unsigned short *d_prand)
{
    int MAXTHREADSx = 512;
    int MAXTHREADSy = 1; 
    int nBlockX = numCols / MAXTHREADSx + 1;
    int nBlockY = numRows / MAXTHREADSy + 1;

  const dim3 blockSize(MAXTHREADSx,MAXTHREADSy,1);

  const dim3 gridSize(nBlockX,nBlockY,1);

  // cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());

  int nstreams = 5;
  cudaStream_t *streams = (cudaStream_t *) malloc(nstreams * sizeof(cudaStream_t));

  for (int i = 0; i < nstreams; i++)
  {
      checkCudaErrors(cudaStreamCreateWithFlags(&(streams[i]),cudaStreamNonBlocking));
  }

  separateChannels<<<gridSize,blockSize>>>(d_inputImageRGBA, 
                                          (int)numRows, 
                                          (int)numCols, 
                                          d_red, 
                                          d_green, 
                                          d_blue);
  cudaDeviceSynchronize(); 

  checkCudaErrors(cudaGetLastError());

    processOneChannel<<<gridSize,blockSize,0,streams[0]>>>(d_red,
                                                          d_redProcessed,
                                                          (int)numRows,(int)numCols,
                                                          d_filter,d_prand);

    processOneChannel<<<gridSize,blockSize,0,streams[1]>>>(d_green,
                                                          d_greenProcessed,
                                                          (int)numRows,(int)numCols,
                                                          d_filter,d_prand);

    processOneChannel<<<gridSize,blockSize,0,streams[2]>>>(d_blue,
                                                          d_blueProcessed,
                                                          (int)numRows,(int)numCols,
                                                          d_filter,d_prand);
  cudaDeviceSynchronize(); 
    checkCudaErrors(cudaGetLastError());

  recombineChannels<<<gridSize, blockSize>>>(d_redProcessed,
                                             d_greenProcessed,
                                             d_blueProcessed,
                                             d_outputImageRGBA,
                                             numRows,
                                             numCols);
      for (int i = 0; i < nstreams; i++)
    {
        cudaStreamDestroy(streams[i]);
    }

    free(streams);
  cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}

Here's nvprof gpu trace output. 这是nvprof gpu跟踪输出。 Note the memcpy before the kernel launches are to pass filter data for the processing, so they cannot run in concurrency with kernel launches. 请注意，在内核启动之前，memcpy将传递过滤数据以进行处理，因此它们不能与内核启动同时运行。

==10001== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
1.02428s  2.2400us                    -               -         -         -         -  28.125MB   1e+04GB/s  GeForce GT 750M         1        13  [CUDA memset]
1.02855s  18.501ms                    -               -         -         -         -  28.125MB  1.4846GB/s  GeForce GT 750M         1        13  [CUDA memcpy HtoD]
1.21959s  1.1371ms                    -               -         -         -         -  1.7580MB  1.5098GB/s  GeForce GT 750M         1        13  [CUDA memcpy HtoD]
1.22083s  1.3440us                    -               -         -         -         -  7.0313MB   5e+03GB/s  GeForce GT 750M         1        13  [CUDA memset]
1.22164s  1.3440us                    -               -         -         -         -  7.0313MB   5e+03GB/s  GeForce GT 750M         1        13  [CUDA memset]
1.22243s  3.6480us                    -               -         -         -         -  7.0313MB   2e+03GB/s  GeForce GT 750M         1        13  [CUDA memset]
1.22349s  10.240us                    -               -         -         -         -  8.0000KB  762.94MB/s  GeForce GT 750M         1        13  [CUDA memcpy HtoD]
1.22351s  6.6021ms           (6 1441 1)       (512 1 1)        12        0B        0B         -           -  GeForce GT 750M         1        13  separateChannels(...) [123]
1.23019s  10.661ms           (6 1441 1)       (512 1 1)        36      192B        0B         -           -  GeForce GT 750M         1        14  processOneChannel(...) [133]
1.24085s  10.518ms           (6 1441 1)       (512 1 1)        36      192B        0B         -           -  GeForce GT 750M         1        15  processOneChannel(...) [141]
1.25137s  10.779ms           (6 1441 1)       (512 1 1)        36      192B        0B         -           -  GeForce GT 750M         1        16  processOneChannel(...) [149]
1.26372s  5.7810ms           (6 1441 1)       (512 1 1)        15        0B        0B         -           -  GeForce GT 750M         1        13  recombineChannels(...) [159]
1.26970s  19.859ms                    -               -         -         -         -  28.125MB  1.3831GB/s  GeForce GT 750M         1        13  [CUDA memcpy DtoH]

Here's CMakeList.txt where I passed -default-stream per-thread to nvcc 这是CMakeList.txt，我在其中将-default-stream每个线程传递给了nvcc

cmake_minimum_required(VERSION 2.6 FATAL_ERROR)

find_package(OpenCV REQUIRED)
find_package(CUDA REQUIRED)

set(
    CUDA_NVCC_FLAGS
    ${CUDA_NVCC_FLAGS};
     -default-stream per-thread
)

file( GLOB  hdr *.hpp *.h )
file( GLOB  cu  *.cu)

SET (My_files main.cpp)

# Project Executable
CUDA_ADD_EXECUTABLE(My ${My_files} ${hdr} ${cu})
target_link_libraries(My ${OpenCV_LIBS})

Answer 1

Each kernel is launching 6*1441 which is over 8000 blocks, of 512 threads each. 每个内核正在启动6 * 1441，这是8000个以上的块，每个块有512个线程。 That is filling the machine, preventing blocks from subsequent kernel launches from executing. 这就充满了机器，阻止了后续内核启动的块执行。

The machine has a capacity . 机器有能力。 The maximum instantaneous capacity in blocks is equal to the number of SMs in your GPU multiplied by the maximum number of blocks per SM, both of which are specifications that you can retrieve with the deviceQuery app. 以块为单位的最大瞬时容量等于GPU中SM的数量乘以每个SM的最大块数量，这两者都是可以通过deviceQuery应用检索的规范。 When you fill it up, it cannot process more blocks until some of the already running blocks have retired. 填满后，它将无法处理更多块，直到一些已经在运行的块退休。 This process will continue for the first kernel launch until most of the blocks have retired. 此过程将在首次启动内核时继续进行，直到大多数模块都退出使用为止。 Then the second kernel will start executing. 然后，第二个内核将开始执行。

CUDA并发内核启动不起作用

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-04-09 22:47:50

CUDA并发内核启动不起作用

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-04-09 22:47:50

解决方案1
2 已采纳 2016-04-09 22:47:50