简体繁体 English

获取 CUDA 上所有内核的总执行时间 stream

[英]Getting total execution time of all kernels on a CUDA stream

原文 2022-06-12 18:35:58 3 1 cuda/ cuda-streams/ cub

I know how to time the execution of one CUDA kernel using CUDA events , which is great for simple cases.我知道如何使用CUDA 事件对一个 CUDA kernel 的执行进行计时，这对于简单的情况非常有用。 But in the real world, an algorithm is often made up of a series of kernels ( CUB::DeviceRadixSort algorithms, for example, launch many kernels to get the job done).但在现实世界中，一个算法通常由一系列内核组成（例如， CUB::DeviceRadixSort算法会启动许多内核来完成工作）。 If you're running your algorithm on a system with a lot of other streams and kernels also in flight, it's not uncommon for the gaps between individual kernel launches to be highly variable based on what other work gets scheduled in-between launches on your stream. If I'm trying to make my algorithm work faster, I don't care so much about how long it spends sitting waiting for resources.如果您正在运行您的算法的系统上还有许多其他流和内核也在运行中，那么根据您的 stream 上其他工作被安排在发射之间的情况，个别 kernel 发射之间的差距是高度可变的并不少见. 如果我想让我的算法运行得更快，我不会太在意它花多长时间等待资源。 I care about the time it spends actually executing.我关心它实际执行所花费的时间。

So the question is, is there some way to do something like the event API and insert a marker in the stream before the first kernel launches, and read it back after your last kernel launches, and have it tell you the actual amount of time spent executing on the stream, rather than the total end-to-end wall-clock time?所以问题是，有没有办法做类似事件 API 的事情，并在第一次 kernel 启动之前在 stream 中插入一个标记，并在最后一次 kernel 启动后读回它，并让它告诉你实际花费的时间在 stream 上执行，而不是总的端到端挂钟时间？ Maybe something in CUPTI can do this?也许CUPTI中的某些东西可以做到这一点？

1 个解决方案

You can use Nsight Systems or Nsight Compute.您可以使用 Nsight Systems 或 Nsight Compute。 ( https://developer.nvidia.com/tools-overview ) ( https://developer.nvidia.com/tools-overview )

In Nsight Systems, you can profile timelines of each stream. Also, you can use Nsight Compute to profile details of each CUDA kernel. I guess Nsight Compute is better because you can inspect various metrics about GPU performances and get hints about the kernel optimization.在 Nsight Systems 中，您可以分析每个 stream 的时间线。此外，您可以使用 Nsight Compute 分析每个 CUDA kernel 的详细信息。我想 Nsight Compute 更好，因为您可以检查有关 GPU 性能的各种指标并获得有关 8815748088324 优化的提示