Concurrency of one large kernel with many small kernels and memcopys (CUDA)

Question

I am developing a Multi-GPU accelerated Flow solver. Currently I am trying to implement communication hiding. That means, while data is exchanged the GPU computes the part of the mesh, that is not involved in communication and computes the rest of the mesh, once communication is done.

I am trying to solve this by having one stream ( computeStream ) for the long run time kernel ( fluxKernel ) and one ( communicationStream ) for the different phases of communication. The computeStream has a very low priority, in order to allow kernels on the communicationStream to interleave the fluxKernel , even though it uses all resources.

These are the streams I am using:

int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low , &priority_high ) ;
cudaStreamCreateWithPriority (&communicationStream, cudaStreamNonBlocking, priority_high );
cudaStreamCreateWithPriority (&computeStream      , cudaStreamNonBlocking, priority_low  );

The desired cocurrency pattern looks like this:

I need synchronization of the communicationStream before I send the data via MPI, to ensure that the data is completely downloaded, before I send it on.

In the following listing I show the structure of what I am currently doing. First I start the long run time fluxKernel for the main part of the mesh on the computeStream . Then I start a sendKernel that collects the data that should be send to the second GPU and subsequently download it to the host (I cannot use cuda-aware MPI due to hardware limitations). The data is then send non-blocking per MPI_Isend and blocking receive ( MPI_recv ) is used subsequently. When the data is received the procedure is done backwards. First the data is uploaded to the device and then spread to the main data structure by recvKernel . Finally the fluxKernel is called for the remaining part of the mesh on the communicationStream .

Note, that before and after the shown code kernels are run on the default stream.

{ ... } // Preparations

// Start main part of computatation on first stream

fluxKernel<<< ..., ..., 0, computeStream >>>( /* main Part */ );

// Prepare send data

sendKernel<<< ..., ..., 0, communicationStream >>>( ... );

cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyDeviceToHost, communicationStream );
cudaStreamSynchronize( communicationStream );

// MPI Communication

MPI_Isend( ... );
MPI_Recv ( ... );

// Use received data

cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyHostToDevice, communicationStream );

recvKernel<<< ..., ..., 0, communicationStream >>>( ... );

fluxKernel<<< ..., ..., 0, communicationStream >>>( /* remaining Part */ );

{ ... } // Rest of the Computations

I used nvprof and Visual Profiler to see, whether the stream actually execute concurrently. This is the result:

I observe that the sendKernel (purple), upload, MPI communication and download are concurrent to the fluxKernel . The recvKernel (red) only starts ofter the other stream is finished, though. Turning of the synchronization does not solve the problem:

For my real application I have not only one communication, but multiple. I tested this with two communications as well. The procedure is:

sendKernel<<< ..., ..., 0, communicationStream >>>( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyDeviceToHost, communicationStream );
cudaStreamSynchronize( communicationStream );
MPI_Isend( ... );

sendKernel<<< ..., ..., 0, communicationStream >>>( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyDeviceToHost, communicationStream );
cudaStreamSynchronize( communicationStream );
MPI_Isend( ... );

MPI_Recv ( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyHostToDevice, communicationStream );
recvKernel<<< ..., ..., 0, communicationStream >>>( ... );

MPI_Recv ( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyHostToDevice, communicationStream );
recvKernel<<< ..., ..., 0, communicationStream >>>( ... );

The result is similar to the one with one communication (above), in the sense that the second kernel invocation (this time it is a sendKernel ) is delayed till the kernel on the computeStream is finished.

Hence the overall observation is, that the second kernel invocation is delayed, independent of which kernel this is.

Can you explain, why the GPU is synchronizing in this way, or how I can get the second Kernel on communicationStream to also run concurrently to the computeStream?

Thank you very much.

Edit 1: complete rework of the question

Minimal Reproducible Example

I built a minimal reproducible Example. In the end the code plots the int data to the terminal. The correct last value would be 32778 (=(32*1024-1) + 1 + 10). At the beginning I added an option integer to test 3 different options:

0: Intended version with synchronisation before CPU modification of data
1: Same as 0, but without synchronization
2: dedicated stream for memcpys and no syncronization

#include <iostream>

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>

const int option = 0;

const int numberOfEntities = 2 * 1024 * 1024;
const int smallNumberOfEntities = 32 * 1024;

__global__ void longKernel(float* dataDeviceIn, float* dataDeviceOut, int numberOfEntities)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if(index >= numberOfEntities) return;

    float tmp = dataDeviceIn[index];

#pragma unroll
    for( int i = 0; i < 2000; i++ ) tmp += 1.0;

    dataDeviceOut[index] = tmp;
}

__global__ void smallKernel_1( int* smallDeviceData, int numberOfEntities )
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if(index >= numberOfEntities) return;

    smallDeviceData[index] = index;
}

__global__ void smallKernel_2( int* smallDeviceData, int numberOfEntities )
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if(index >= numberOfEntities) return;

    int value = smallDeviceData[index];

    value += 10;

    smallDeviceData[index] = value;
}


int main(int argc, char **argv)
{
    cudaSetDevice(0);

    float* dataDeviceIn;
    float* dataDeviceOut;

    cudaMalloc( &dataDeviceIn , sizeof(float) * numberOfEntities );
    cudaMalloc( &dataDeviceOut, sizeof(float) * numberOfEntities );

    int* smallDataDevice;
    int* smallDataHost;

    cudaMalloc    ( &smallDataDevice, sizeof(int) * smallNumberOfEntities );
    cudaMallocHost( &smallDataHost  , sizeof(int) * smallNumberOfEntities );

    cudaStream_t streamLong;
    cudaStream_t streamSmall;
    cudaStream_t streamCopy;

    int priority_high, priority_low;
    cudaDeviceGetStreamPriorityRange(&priority_low , &priority_high ) ;
    cudaStreamCreateWithPriority (&streamLong , cudaStreamNonBlocking, priority_low  );
    cudaStreamCreateWithPriority (&streamSmall, cudaStreamNonBlocking, priority_high );
    cudaStreamCreateWithPriority (&streamCopy , cudaStreamNonBlocking, priority_high );

    //////////////////////////////////////////////////////////////////////////

    longKernel <<< numberOfEntities / 32, 32, 0, streamLong >>> (dataDeviceIn, dataDeviceOut, numberOfEntities);

    //////////////////////////////////////////////////////////////////////////

    smallKernel_1 <<< smallNumberOfEntities / 32, 32, 0 , streamSmall >>> (smallDataDevice, smallNumberOfEntities);

    if( option <= 1 ) cudaMemcpyAsync( smallDataHost, smallDataDevice, sizeof(int) * smallNumberOfEntities, cudaMemcpyDeviceToHost, streamSmall );
    if( option == 2 ) cudaMemcpyAsync( smallDataHost, smallDataDevice, sizeof(int) * smallNumberOfEntities, cudaMemcpyDeviceToHost, streamCopy  );

    if( option == 0 ) cudaStreamSynchronize( streamSmall );

    // some CPU modification of data
    for( int i = 0; i < smallNumberOfEntities; i++ ) smallDataHost[i] += 1;

    if( option <= 1 ) cudaMemcpyAsync( smallDataDevice, smallDataHost, sizeof(int) * smallNumberOfEntities, cudaMemcpyHostToDevice, streamSmall );
    if( option == 2 ) cudaMemcpyAsync( smallDataDevice, smallDataHost, sizeof(int) * smallNumberOfEntities, cudaMemcpyHostToDevice, streamCopy  );

    smallKernel_2 <<< smallNumberOfEntities / 32, 32, 0 , streamSmall >>> (smallDataDevice, smallNumberOfEntities);

    //////////////////////////////////////////////////////////////////////////

    cudaDeviceSynchronize();

    cudaMemcpy( smallDataHost, smallDataDevice, sizeof(int) * smallNumberOfEntities, cudaMemcpyDeviceToHost );

    for( int i = 0; i < smallNumberOfEntities; i++ ) std::cout << smallDataHost[i] << "\n";

    return 0;
}

With code I see the same behavior as described above:

Option 0 (correct result):

Option 1 (wrong reslut, +1 from CPU missing):

Option 2 (completely wrong result, all 10, dowload before smallKernel_1 )

Solutions:

Running Option 0 under Linux (on the suggestion in Roberts answere), brings the expected behavior!

Answer 1

Here's how I would try to accomplish this.

Use a high-priority/low-priority stream arrangement as you suggest.
Only 2 streams should be needed
Make sure to pin host memory to allow compute/copy overlap
Since you don't intend to use cuda-aware MPI, your MPI transactions are purely host activity. Therefore we can use a stream callback to insert this host activity into the high-priority stream.
To allow the high-priority kernels to easily insert themselves into the low-priority kernels, I choose a design strategy of grid-stride-loop for the high priority copy kernels, but non-grid-stride-loop for low priority kernels. We want the low priority kernels to have a larger number of blocks, so that blocks are launching and retiring all the time, easily allowing the GPU block scheduler to insert high-priority blocks as they become available.
The work issuance per "frame" uses no synchronize calls of any kind. I am using a cudaDeviceSynchronize() once per loop/frame, to break (separate) the processing of one frame from the next. Arrangement of activities within a frame is handled entirely with CUDA stream semantics, to enforce serialization for activities which depend on each other, but to allow concurrency for activities that don't.

Here's a sample code that implements these ideas:

#include <iostream>
#include <unistd.h>
#include <cstdio>

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

typedef double mt;
const int nTPB = 512;
const size_t ds = 100ULL*1048576;
const size_t bs = 1048576ULL;
const int  my_intensity = 1;
const int loops = 4;
const size_t host_func_delay_us = 100;
const int max_blocks = 320; // chosen based on GPU, could use runtime calls to set this via cudaGetDeviceProperties

template <typename T>
__global__ void fluxKernel(T * __restrict__ d, const size_t n, const int intensity){

  size_t idx = ((size_t)blockDim.x) * blockIdx.x + threadIdx.x;
  if (idx < n){
    T temp = d[idx];
    for (int i = 0; i < intensity; i++)
      temp = sin(temp);  // just some dummy code to simulate "real work"
    d[idx] = temp;
    }
}

template <typename T>
__global__ void sendKernel(const T * __restrict__ d, const size_t n, T * __restrict__ b){

  for (size_t idx = ((size_t)blockDim.x) * blockIdx.x + threadIdx.x; idx < n; idx += ((size_t)blockDim.x)*gridDim.x)
    b[idx] = d[idx];
}

template <typename T>
__global__ void recvKernel(const T * __restrict__ b, const size_t n, T * __restrict__ d){

  for (size_t idx = ((size_t)blockDim.x) * blockIdx.x + threadIdx.x; idx < n; idx += ((size_t)blockDim.x)*gridDim.x)
    d[idx] = b[idx];
}

void CUDART_CB MyCallback(cudaStream_t stream, cudaError_t status, void *data){
    printf("Loop %lu callback\n", (size_t)data);
    usleep(host_func_delay_us); // simulate: this is where non-cuda-aware MPI calls would go, operating on h_buf
}
int main(){

  // get the range of stream priorities for this device
  int priority_high, priority_low;
  cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
  // create streams with highest and lowest available priorities
  cudaStream_t st_high, st_low;
  cudaStreamCreateWithPriority(&st_high, cudaStreamNonBlocking, priority_high);
  cudaStreamCreateWithPriority(&st_low, cudaStreamNonBlocking, priority_low);
  // allocations
  mt *h_buf, *d_buf, *d_data;
  cudaMalloc(&d_data, ds*sizeof(d_data[0]));
  cudaMalloc(&d_buf, bs*sizeof(d_buf[0]));
  cudaHostAlloc(&h_buf, bs*sizeof(h_buf[0]), cudaHostAllocDefault);
  cudaCheckErrors("setup error");
  // main processing loop
  for (unsigned long i = 0; i < loops; i++){
    // issue low-priority
    fluxKernel<<<((ds-bs)+nTPB)/nTPB, nTPB,0,st_low>>>(d_data+bs, ds-bs, my_intensity);
    // issue high-priority
    sendKernel<<<max_blocks,nTPB,0,st_high>>>(d_data, bs, d_buf);
    cudaMemcpyAsync(h_buf, d_buf, bs*sizeof(h_buf[0]), cudaMemcpyDeviceToHost, st_high);
    cudaStreamAddCallback(st_high, MyCallback, (void*)i, 0);
    cudaMemcpyAsync(d_buf, h_buf, bs*sizeof(h_buf[0]), cudaMemcpyHostToDevice, st_high);
    recvKernel<<<max_blocks,nTPB,0,st_high>>>(d_buf, bs, d_data);
    fluxKernel<<<((bs)+nTPB)/nTPB, nTPB,0,st_high>>>(d_data, bs, my_intensity);
    cudaDeviceSynchronize();
    cudaCheckErrors("loop error");
    }
  return 0;
}

Here is the visual profiler timeline output (on linux, Tesla V100):

Note that arranging complex concurrency scenarios can be quite challenging on Windows WDDM. I would recommend avoiding that, and this answer does not intend to discuss all the challenges there. I suggest using linux or Windows TCC GPUs to do this.

If you try this code on your machine, you may need to adjust some of the various constants to get things to look like this.

Concurrency of one large kernel with many small kernels and memcopys (CUDA)

Question

Minimal Reproducible Example

Solutions:

1 answers

solution1
2 ACCPTED 2019-07-17 15:01:57

Concurrency of one large kernel with many small kernels and memcopys (CUDA)

Question

Minimal Reproducible Example

Solutions:

1 answers

solution1 2 ACCPTED 2019-07-17 15:01:57

solution1
2 ACCPTED 2019-07-17 15:01:57