CUDA 中的重叠传输和 kernel 执行与两个循环

Question

I want to overlap data transfers and kernel executions in a form like this:我想以如下形式重叠数据传输和 kernel 执行：

int numStreams = 3;
int size = 10;

for(int i = 0; i < size; i++) {
    cuMemcpyHtoDAsync( _bufferIn1,
                           _host_memoryIn1 ),
                           _size * sizeof(T),
                           cuda_stream[i % numStreams]);

    cuMemcpyHtoDAsync( _bufferIn2,
                           _host_memoryIn2,
                           _size * sizeof(T),
                           cuda_stream[i % numStreams]);

        cuLaunchKernel( _kernel,
                        gs.x(), gs.y(), gs.z(),
                        bs.x(), bs.y(), bs.z(),
                        _memory_size,
                        cuda_stream[i % numStreams],
                        _kernel_arguments,
                        0
                      );
      cuEventRecord(event[i], cuda_stream);
}

for(int i = 0; i < size; i++) {
    cuEventSynchronize(events[i]);

    cuMemcpyDtoHAsync( _host_memoryOut,
                           _bufferOut,
                           _size * sizeof(T),
                           cuda_stream[i % numStreams]);
}

Is overlapping possible in this case?在这种情况下可以重叠吗？ Currently only the HtoD-transfers overlap with the kernel executions.目前只有 HtoD 传输与 kernel 执行重叠。 The first DtoH-transfer is executed after the last kernel execution.第一次 DtoH 传输在最后一次 kernel 执行之后执行。

Answer 1

Overlapping is possible only when the operations are executed on different streams.只有在不同的流上执行操作时，才可能发生重叠。 CUDA operations in the same stream are executed sequentially by the host calling order so that the copy from the device to host at the end will be executed once all the operations on corresponding streams are completed.同一个stream中的CUDA操作是按主机调用顺序依次执行的，这样在对应流的所有操作完成后，就会执行从设备到最后主机的复制。 The overlap doesn't happen because both the last kernel and the first copy are executed on stream 0, so the copy has to wait for the kernel to finish.重叠不会发生，因为最后一个 kernel 和第一个副本都在 stream 0 上执行，因此副本必须等待 kernel 完成。 Since you are synchronizing with an event at each loop iteration, the other copies on the other streams (stream 1 and 2) are not called yet.由于您在每次循环迭代中与事件同步，因此尚未调用其他流（流 1 和 2）上的其他副本。

CUDA 中的重叠传输和 kernel 执行与两个循环

问题描述

1 个解决方案

解决方案1
2 2020-04-16 22:34:42

CUDA 中的重叠传输和 kernel 执行与两个循环

问题描述

1 个解决方案

解决方案1 2 2020-04-16 22:34:42

解决方案1
2 2020-04-16 22:34:42