简体   繁体   中英

CUDA C++ overlapping SERIAL kernel execution and data transfer

So this guide here shows the general way to overlap kernel execution and data transfer.

cudaStream_t streams[nStreams];
for (int i = 0; i < nStreams; ++i) {
  cudaStreamCreate(&streams[i]);
  int offset = ...;
  cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes, cudaMemcpyHostToDevice, stream[i]);
  kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset);
  // edit: no deviceToHost copy
}

However, the kernel is serial. So it must process 0->1000, then 1000->2000, ... In short, the order to correctly perform this kernel while overlapping data transfer is:

  • copy[a->b] must happen before kernel[a->b]
  • kernel [a->b] must happen before kernel[b->c], where c > a, b

Is it possible to do this without using cudaDeviceSynchronize() ? If not, what's the fastest way to do it?

So each kernel is dependent on (cannot begin until):

  1. The associated H->D copy is complete
  2. The previous kernel execution is complete

Ordinary stream semantics won't handle this case (2 separate dependencies, from 2 separate streams), so we'll need to put an extra interlock in there. We can use a set of events and cudaStreamWaitEvent() to handle it.

For the most general case (no knowledge of the total number of chunks) I would recommend something like this:

$ cat t1783.cu
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

template <typename T>
__global__ void process(const T * __restrict__ in, const T * __restrict__ prev, T * __restrict__ out, size_t ds){

  for (size_t i = threadIdx.x+blockDim.x*blockIdx.x; i < ds; i += gridDim.x*blockDim.x){
    out[i] = in[i] + prev[i];
    }
}
const int nTPB = 256;
typedef int mt;
const int chunk_size = 1048576;
const int data_size = 10*1048576;
const int ns = 3;

int main(){

  mt *din, *dout, *hin, *hout;
  cudaStream_t str[ns];
  cudaEvent_t  evt[ns];
  for (int i = 0; i < ns; i++) {
    cudaStreamCreate(str+i);
    cudaEventCreate( evt+i);}
  cudaMalloc(&din, sizeof(mt)*data_size);
  cudaMalloc(&dout, sizeof(mt)*data_size);
  cudaHostAlloc(&hin,  sizeof(mt)*data_size, cudaHostAllocDefault);
  cudaHostAlloc(&hout, sizeof(mt)*data_size, cudaHostAllocDefault);
  cudaMemset(dout, 0, sizeof(mt)*chunk_size);  // for first loop iteration
  for (int i = 0; i < data_size; i++) hin[i] = 1;
  cudaEventRecord(evt[ns-1], str[ns-1]); // this event will immediately "complete"
  unsigned long long dt = dtime_usec(0);
  for (int i = 0; i < (data_size/chunk_size); i++){
    cudaStreamSynchronize(str[i%ns]); // so we can reuse event safely
    cudaMemcpyAsync(din+i*chunk_size, hin+i*chunk_size, sizeof(mt)*chunk_size, cudaMemcpyHostToDevice, str[i%ns]);
    cudaStreamWaitEvent(str[i%ns], evt[(i>0)?(i-1)%ns:ns-1], 0);
    process<<<(chunk_size+nTPB-1)/nTPB, nTPB, 0, str[i%ns]>>>(din+i*chunk_size, dout+((i>0)?(i-1)*chunk_size:0), dout+i*chunk_size, chunk_size);
    cudaEventRecord(evt[i%ns]);
    cudaMemcpyAsync(hout+i*chunk_size, dout+i*chunk_size, sizeof(mt)*chunk_size, cudaMemcpyDeviceToHost, str[i%ns]);
    }
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  for (int i = 0; i < data_size; i++) if (hout[i] != (i/chunk_size)+1) {std::cout << "error at index: " << i << " was: " << hout[i] << " should be: " << (i/chunk_size)+1 << std::endl; return 0;}
  std::cout << "elapsed time: " << dt << " microseconds" << std::endl;
}
$ nvcc -o t1783 t1783.cu
$ ./t1783
elapsed time: 4366 microseconds

Good practice here would be to use a profiler to verify the expected overlap scenarios. However, we can take a shortcut based on the elapsed time measurement.

The loop is transferring a total of 40MB of data to the device, and 40MB back. The elapsed time is 4366us. This gives an average throughput for each direction of (40*1048576)/4366 or 9606 bytes/us which is 9.6GB/s. This is basically saturating the Gen3 link in both directions, therefore my chunk processing is approximately back-to-back, and I have essentially full overlap of D->H with H->D memcopies. The kernel here is trivial so it shows up as just slivers in the profile.

For your case, you indicated you didn't need the D->H copy, but it adds no extra complexity so I chose to show it. The desired behavior still occurs if you comment that line out of the loop (although this affects results checking later).

A possible criticism of this approach is that the cudaStreamSynchronize() call, which is necessary so we don't "overrun" the event interlock, means that the loop will only proceed to ns number of iterations beyond the one that is currently executing on the device. So it is not possible to launch more work asynchronously than that. If you wanted to launch all the work at once and go on and do something else on the CPU, this method will not fully allow that (the CPU will proceed past the loop when the stream processing has reach ns iterations from the last one).

The code is presented to illustrate an approach, conceptually. It is not guaranteed to be defect free, nor do I claim it is suitable for any particular purpose.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM