使用 Nsight Systems 跟踪自定义 CUDA 内核

Question

I work on library which is implemented in C++20 and CUDA 11. This library is called from Python via ctypes through a C API that just exchanges JSON strings. I work on library which is implemented in C++20 and CUDA 11. This library is called from Python via ctypes through a C API that just exchanges JSON strings. We compile it using Clang 11.我们使用 Clang 11 对其进行编译。

In order to profile the code I have added a lot of NVTX ranges to the C++ code.为了分析代码，我在 C++ 代码中添加了很多 NVTX 范围。 This works well for me with Nsight Systems, I can see the stack of ranges with their manually chosen names when use nsys profile -t nvtx … to gather data.这对我使用 Nsight Systems 很有效，当使用nsys profile -t nvtx …收集数据时，我可以看到带有手动选择名称的范围堆栈。 This doesn't tell me anything about the GPU though.不过，这并没有告诉我有关 GPU 的任何信息。 So I specify nvtx,cuda,cublas,cudnn in order to get more information.所以我指定nvtx,cuda,cublas,cudnn以获得更多信息。

But all I get is one of the many kernels.但我得到的只是众多内核之一。 The output looks like this: output 看起来像这样：

One can see the nice NVTX contexts, one can see the calls to the CUDA API (memcpy and the like).可以看到漂亮的 NVTX 上下文，可以看到对 CUDA API（memcpy 等）的调用。 But there is only one kernel showing up, I have marked it with a red arrow.但是只有一个 kernel 出现，我用红色箭头标记了它。

We have a bunch of different kernels and launch them with the <<<>>> syntax right from the .cu files.我们有一堆不同的内核，并使用.cu文件中的<<<>>>语法启动它们。

It feels like I am missing either a tracing flag for nsys , some compilation option for the CUDA code or some code annotations like NVTX for the kernel code.感觉就像我缺少 nsys 的跟踪标志、 nsys代码的一些编译选项或 kernel 代码的一些代码注释（如 NVTX）。 What do I have to do such that my custom kernels show up in the profile?我必须做什么才能使我的自定义内核显示在配置文件中？

Answer 1

The issue could have been that I have not properly stopped the data gathering and our program is an interactive server which one stops with a SIGINT.问题可能是我没有正确停止数据收集，我们的程序是一个交互式服务器，它会以 SIGINT 停止。 Perhaps the data was not properly stored after the interrupt.可能中断后数据没有正确存储。

I have added calls to the profiler API in the code such that I explicitly call cudaProfilerStop() after our main loop is done.我在代码中添加了对分析器 API 的调用，以便在我们的主循环完成后显式调用cudaProfilerStop() 。 I've done it with a small RAII wrapper such that it works even with SIGINT.我用一个小的 RAII 包装器完成了它，这样它甚至可以与 SIGINT 一起使用。

#include <cuda_profiler_api.h>

class ProfilingRange {
 public:
  ProfilingRange() {
    cudaProfilerStart();
  }

  ~ProfilingRange() {
    cudaProfilerStop();
  }
};

On the nsys profile command line I specify --capture-range=cudaProfilerApi and it seems to work fine.在nsys profile命令行上，我指定--capture-range=cudaProfilerApi ，它似乎工作正常。 Now a lot of kernels show up, and I can learn a lot more about the system.现在出现了很多内核，我可以了解更多关于系统的信息。

使用 Nsight Systems 跟踪自定义 CUDA 内核

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-04-21 10:40:55

使用 Nsight Systems 跟踪自定义 CUDA 内核

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-04-21 10:40:55

解决方案1
1 已采纳 2021-04-21 10:40:55