CUDA C++ 重叠 SERIAL kernel 执行和数据传输

Question

So this guide here shows the general way to overlap kernel execution and data transfer.所以本指南在这里展示了重叠 kernel 执行和数据传输的一般方法。

cudaStream_t streams[nStreams];
for (int i = 0; i < nStreams; ++i) {
  cudaStreamCreate(&streams[i]);
  int offset = ...;
  cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes, cudaMemcpyHostToDevice, stream[i]);
  kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset);
  // edit: no deviceToHost copy
}

However, the kernel is serial.但是，kernel 是串行的。 So it must process 0->1000, then 1000->2000, ... In short, the order to correctly perform this kernel while overlapping data transfer is:所以它必须处理 0->1000，然后 1000->2000，... 简而言之，在重叠数据传输时正确执行此 kernel 的顺序是：

copy[a->b] must happen before kernel[a->b]复制[a->b] 必须在内核[a->b] 之前发生
kernel [a->b] must happen before kernel[b->c], where c > a, b kernel [a->b] 必须发生在内核[b->c] 之前，其中 c > a, b

Is it possible to do this without using cudaDeviceSynchronize() ?不使用cudaDeviceSynchronize()是否可以做到这一点？ If not, what's the fastest way to do it?如果没有，最快的方法是什么？

Answer 1

So each kernel is dependent on (cannot begin until):所以每个 kernel 都依赖于（直到不能开始）：

The associated H->D copy is complete关联的 H->D 副本已完成
The previous kernel execution is complete之前的kernel执行完毕

Ordinary stream semantics won't handle this case (2 separate dependencies, from 2 separate streams), so we'll need to put an extra interlock in there.普通的 stream 语义无法处理这种情况（2 个独立的依赖项，来自 2 个独立的流），因此我们需要在其中放置一个额外的互锁。 We can use a set of events and cudaStreamWaitEvent() to handle it.我们可以使用一组事件和cudaStreamWaitEvent()来处理它。

For the most general case (no knowledge of the total number of chunks) I would recommend something like this:对于最一般的情况（不知道块的总数）我会推荐这样的东西：

$ cat t1783.cu
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

template <typename T>
__global__ void process(const T * __restrict__ in, const T * __restrict__ prev, T * __restrict__ out, size_t ds){

  for (size_t i = threadIdx.x+blockDim.x*blockIdx.x; i < ds; i += gridDim.x*blockDim.x){
    out[i] = in[i] + prev[i];
    }
}
const int nTPB = 256;
typedef int mt;
const int chunk_size = 1048576;
const int data_size = 10*1048576;
const int ns = 3;

int main(){

  mt *din, *dout, *hin, *hout;
  cudaStream_t str[ns];
  cudaEvent_t  evt[ns];
  for (int i = 0; i < ns; i++) {
    cudaStreamCreate(str+i);
    cudaEventCreate( evt+i);}
  cudaMalloc(&din, sizeof(mt)*data_size);
  cudaMalloc(&dout, sizeof(mt)*data_size);
  cudaHostAlloc(&hin,  sizeof(mt)*data_size, cudaHostAllocDefault);
  cudaHostAlloc(&hout, sizeof(mt)*data_size, cudaHostAllocDefault);
  cudaMemset(dout, 0, sizeof(mt)*chunk_size);  // for first loop iteration
  for (int i = 0; i < data_size; i++) hin[i] = 1;
  cudaEventRecord(evt[ns-1], str[ns-1]); // this event will immediately "complete"
  unsigned long long dt = dtime_usec(0);
  for (int i = 0; i < (data_size/chunk_size); i++){
    cudaStreamSynchronize(str[i%ns]); // so we can reuse event safely
    cudaMemcpyAsync(din+i*chunk_size, hin+i*chunk_size, sizeof(mt)*chunk_size, cudaMemcpyHostToDevice, str[i%ns]);
    cudaStreamWaitEvent(str[i%ns], evt[(i>0)?(i-1)%ns:ns-1], 0);
    process<<<(chunk_size+nTPB-1)/nTPB, nTPB, 0, str[i%ns]>>>(din+i*chunk_size, dout+((i>0)?(i-1)*chunk_size:0), dout+i*chunk_size, chunk_size);
    cudaEventRecord(evt[i%ns]);
    cudaMemcpyAsync(hout+i*chunk_size, dout+i*chunk_size, sizeof(mt)*chunk_size, cudaMemcpyDeviceToHost, str[i%ns]);
    }
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  for (int i = 0; i < data_size; i++) if (hout[i] != (i/chunk_size)+1) {std::cout << "error at index: " << i << " was: " << hout[i] << " should be: " << (i/chunk_size)+1 << std::endl; return 0;}
  std::cout << "elapsed time: " << dt << " microseconds" << std::endl;
}
$ nvcc -o t1783 t1783.cu
$ ./t1783
elapsed time: 4366 microseconds

Good practice here would be to use a profiler to verify the expected overlap scenarios.此处的良好做法是使用分析器来验证预期的重叠场景。 However, we can take a shortcut based on the elapsed time measurement.但是，我们可以根据经过的时间测量走捷径。

The loop is transferring a total of 40MB of data to the device, and 40MB back.该循环将总共 40MB 的数据传输到设备，然后再传回 40MB。 The elapsed time is 4366us.经过的时间是4366us。 This gives an average throughput for each direction of (40*1048576)/4366 or 9606 bytes/us which is 9.6GB/s.这给出了 (40*1048576)/4366 或 9606 字节/us 的每个方向的平均吞吐量，即 9.6GB/s。 This is basically saturating the Gen3 link in both directions, therefore my chunk processing is approximately back-to-back, and I have essentially full overlap of D->H with H->D memcopies.这基本上使 Gen3 链路在两个方向上都饱和，因此我的块处理大致是背靠背的，并且我基本上完全重叠了 D->H 与 H->D 内存副本。 The kernel here is trivial so it shows up as just slivers in the profile.这里的 kernel 是微不足道的，因此它在配置文件中仅显示为条子。

For your case, you indicated you didn't need the D->H copy, but it adds no extra complexity so I chose to show it.对于您的情况，您表示不需要 D->H 副本，但它不会增加额外的复杂性，因此我选择展示它。 The desired behavior still occurs if you comment that line out of the loop (although this affects results checking later).如果您将该行注释到循环之外，仍会发生所需的行为（尽管这会影响稍后的结果检查）。

A possible criticism of this approach is that the cudaStreamSynchronize() call, which is necessary so we don't "overrun" the event interlock, means that the loop will only proceed to ns number of iterations beyond the one that is currently executing on the device.对这种方法的一个可能的批评是cudaStreamSynchronize()调用，这是必要的，因此我们不会“超出”事件联锁，这意味着循环将只进行ns迭代，超出当前正在执行的迭代。设备。 So it is not possible to launch more work asynchronously than that.因此，不可能异步启动更多的工作。 If you wanted to launch all the work at once and go on and do something else on the CPU, this method will not fully allow that (the CPU will proceed past the loop when the stream processing has reach ns iterations from the last one).如果您想立即启动所有工作并启动 go 并在 CPU 上执行其他操作，则此方法不会完全允许（当 stream 处理从最后一个迭代达到ns迭代时，CPU 将继续执行循环）。

The code is presented to illustrate an approach, conceptually.提供代码是为了从概念上说明一种方法。 It is not guaranteed to be defect free, nor do I claim it is suitable for any particular purpose.它不保证没有缺陷，我也不声称它适用于任何特定目的。

CUDA C++ 重叠 SERIAL kernel 执行和数据传输

问题描述

1 个解决方案

解决方案1
4 已采纳 2020-08-15 04:25:06

CUDA C++ 重叠 SERIAL kernel 执行和数据传输

问题描述

1 个解决方案

解决方案1 4 已采纳 2020-08-15 04:25:06

解决方案1
4 已采纳 2020-08-15 04:25:06