使用 cudaEventRecord() 为多 GPU 程序记录 CUDA 内核的运行时间

Question

I have a sparse triangular solver that works with 4 Tesla V100 GPUs.我有一个稀疏三角形求解器，可与 4 个 Tesla V100 GPU 配合使用。 I completed implementation and all things work well in terms of accuracy.我完成了实施，一切都在准确性方面运作良好。 However, I am using a CPU timer to calculate elapsed time.但是，我使用 CPU 计时器来计算经过的时间。 I know that the CPU timer is not the perfect choice for calculating elapsed time, since I can use CUDA Events.我知道 CPU 计时器不是计算经过时间的完美选择，因为我可以使用 CUDA 事件。

But the thing is, I do not know how to implement CUDA Events for multi GPU.但问题是，我不知道如何为多 GPU 实现 CUDA 事件。 As I saw from NVIDIA tutorials, they use events for inter-GPU synchronization, ie waiting for other GPUs to finish dependencies.正如我从 NVIDIA 教程中看到的，他们使用事件进行 GPU 间同步，即等待其他 GPU 完成依赖关系。 Anyway, I define events like;无论如何，我定义了这样的事件；

cudaEvent_t start_events[num_gpus]
cudaEvent_t end_events[num_gpus]

I can also initialize these events in a loop by setting the current GPU iteratively.我还可以通过迭代设置当前 GPU 在循环中初始化这些事件。

And my kernel execution is like;我的内核执行就像；

 for(int i = 0; i < num_gpus; i++)
 {
     CUDA_FUNC_CALL(cudaSetDevice(i));
     kernel<<<>>>()
 }

 for(int i = 0; i < num_devices; i++)
 {
     CUDA_FUNC_CALL(cudaSetDevice(i));
     CUDA_FUNC_CALL(cudaDeviceSynchronize());
 }

My question is, how should I use these events to record elapsed times for each GPU separately?我的问题是，我应该如何使用这些事件分别记录每个 GPU 的经过时间？

Answer 1

You need to create two events per GPU, and record the events before and after the kernel call on each GPU.您需要为每个 GPU 创建两个事件，并在每个 GPU 上记录内核调用前后的事件。

It could look something like this:它可能看起来像这样：

cudaEvent_t start_events[num_gpus];
cudaEvent_t end_events[num_gpus];

for(int i = 0; i < num_gpus; i++)
 {
     CUDA_FUNC_CALL(cudaSetDevice(i));
     CUDA_FUNC_CALL(cudaEventCreate(&start_events[i]));
     CUDA_FUNC_CALL(cudaEventCreate(&end_events[i]));
 }

 for(int i = 0; i < num_gpus; i++)
 {
     CUDA_FUNC_CALL(cudaSetDevice(i));
     // In cudaEventRecord, ommit stream or set it to 0 to record 
     // in the default stream. It must be the same stream as 
     // where the kernel is launched.
     CUDA_FUNC_CALL(cudaEventRecord(start_events[i], stream)); 
     kernel<<<>>>()
     CUDA_FUNC_CALL(cudaEventRecord(end_events[i], stream));
 }

 for(int i = 0; i < num_devices; i++)
 {
     CUDA_FUNC_CALL(cudaSetDevice(i));
     CUDA_FUNC_CALL(cudaDeviceSynchronize());
 }

 for(int i = 0; i < num_devices; i++)
 {
     //the end_event must have happened to get a valid duration
     //In this example, this is true because of previous device synchronization
     float time_in_ms;
     CUDA_FUNC_CALL(cudaEventElapsedTime(&time_in_ms, start_events[i], end_events[i]));
     printf("Elapsed time on device %d: %f ms\n", i, time_in_ms)
 }

for(int i = 0; i < num_gpus; i++)
 {
     CUDA_FUNC_CALL(cudaSetDevice(i));
     CUDA_FUNC_CALL(cudaEventDestroy(start_events[i]));
     CUDA_FUNC_CALL(cudaEventDestroy(end_events[i]));
 }

使用 cudaEventRecord() 为多 GPU 程序记录 CUDA 内核的运行时间

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-11-18 17:57:44

使用 cudaEventRecord() 为多 GPU 程序记录 CUDA 内核的运行时间

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-11-18 17:57:44

解决方案1
1 已采纳 2020-11-18 17:57:44