简体繁体 English

不使用 NVCC 时是否使用 NVIDIA 的 JIT 编译缓存？

[英]Is NVIDIA's JIT compilation cache used when you don't use NVCC?

原文 2022-05-16 13:40:23 6 2 cuda/ jit/ nvcc/ cuda-driver/ cuda-jit-cache

As we should all know (but not enough people do), when you build a CUDA program with NVCC, and run it on a device for which fully-compiled (SASS) code for the specific device is not included in the binary - the intermediate PTX code is JITed, and the result is actually used for running your kernels.我们都应该知道（但没有足够多的人这样做），当您使用 NVCC 构建 CUDA 程序并在二进制文件中不包含特定设备的完全编译 (SASS) 代码的设备上运行它时 - 中间PTX 代码是 JITed，其结果实际上用于运行您的内核。 During this JITing, a JIT compilation cache kicks in, so that, next time you run the same executable, the compilation can be skipped in favor of just loading the result.在这个 JITing 期间， JIT 编译缓存启动，因此，下次运行相同的可执行文件时，可以跳过编译而只加载结果。

Now, suppose I'm writing C++ file which compiles a kernel dynamically, at run-time, rather than using NVCC, eg:现在，假设我正在编写 C++ 文件，它在运行时动态编译内核，而不是使用 NVCC，例如：

I use NVRTC's nvrtcCompileProgram() to compile CUDA C++ code, targeting a concrete architecture (eg sm_70 ).我使用 NVRTC 的nvrtcCompileProgram()编译 CUDA C++ 代码，针对具体架构（例如sm_70 ）。
I use the CUDA driver's cuModuleLoad() to load a PTX file with the kernel.我使用 CUDA 驱动程序的cuModuleLoad()来使用内核加载 PTX 文件。

will the compilation result be placed in that cache?编译结果会放在那个缓存中吗？

2 个解决方案

The caching behaviour you are describing has nothing to do with either nvcc or nvrtc.您描述的缓存行为与 nvcc 或 nvrtc 无关。 The caching of runtime JIT compiled code is a driver level mechanism which is provided primarily for implementing compatibility of newer hardware with older code.运行时 JIT 编译代码的缓存是一种驱动程序级机制，主要用于实现新硬件与旧代码的兼容性。

There are exactly three cases to consider when running CUDA code using either the runtime or driver API to run a kernel:使用运行时或驱动程序 API 运行 CUDA 代码以运行内核时，需要考虑三种情况：

The application provides compatible SASS to the driver (be that a statically linked payload in a runtime API application, or SASS loaded from a file, or SASS emitted by using nvrtc with a physical architecture as a target).应用程序向驱动程序提供兼容的 SASS（可以是运行时 API 应用程序中静态链接的有效负载，或从文件加载的 SASS，或使用 nvrtc 以物理架构作为目标发出的 SASS）。 In this case the SASS is loaded and executed.在这种情况下，SASS 被加载并执行。 No caching is involved.不涉及缓存。
The application provides valid PTX code (be that from a fatbinary payload in the case where there is no compatible SASS present, or loaded via the driver API, whatever the source of that payload is, which includes nvrtc in the case where a virtual architecture is used as a target).该应用程序提供有效的 PTX 代码（在不存在兼容的 SASS 的情况下来自 fatbinary 有效负载，或者通过驱动程序 API 加载，无论该有效负载的来源是什么，在虚拟架构的情况下包括 nvrtc用作目标）。 In this case the driver triggers JIT compilation of the PTX and loads the results SASS to execute.在这种情况下，驱动程序触发 PTX 的 JIT 编译并加载结果 SASS 以执行。 This is where caching occurs.这就是缓存发生的地方。 The driver will check the user specific private cache of the JIT output, if it exists and if it finds a match to PTX it has previously compiled, it retrieves the SASS from the cache and uses it rather than compile the same PTX again.驱动程序将检查 JIT 输出的用户特定的私有缓存，如果它存在并且如果它找到与之前编译的 PTX 匹配，它会从缓存中检索 SASS 并使用它，而不是再次编译相同的 PTX。 This mechanism can be defeated by setting CUDA_CACHE_DISABLE to 1. A fuller discussion of this mechanism and its controls can be found here .可以通过将CUDA_CACHE_DISABLE设置为 1 来消除此机制。可以在此处找到有关此机制及其控件的更全面的讨论。 If the PTX is invalid, an invalid (or incompatible) PTX error message will be returned to the caller and execution fails如果 PTX 无效，则会将无效（或不兼容）的 PTX 错误消息返回给调用者并且执行失败
The application provides neither compatible SASS, nor PTX.该应用程序既不提供兼容的 SASS，也不提供 PTX。 In this case a no binary for GPU (or its runtime API equivalent) error will be returned to the caller and execution fails.在这种情况下，将向调用者返回 no binary for GPU（或其运行时 API 等效项）错误，并且执行失败。 The driver PTX cache plays no role in this case.在这种情况下，驱动程序 PTX 缓存不起作用。

So to your two scenarios:所以对于你的两种情况：

I use NVRTC's nvrtcCompileProgram() to compile CUDA C++ code, targeting a concrete architecture (eg sm_70 ).我使用 NVRTC 的nvrtcCompileProgram()编译 CUDA C++ 代码，针对具体架构（例如sm_70 ）。

In this scenario, you fall into the first or third cases above.在这种情况下，您属于上述第一种或第三种情况。 The binary payload will be loaded and executed if valid, or fail with an error if invalid.如果有效，二进制有效负载将被加载并执行，如果无效则失败并出现错误。 No caching occurs.不发生缓存。

I use the CUDA driver's cuModuleLoad() to load a PTX file with the kernel.我使用 CUDA 驱动程序的 cuModuleLoad() 来使用内核加载 PTX 文件。

In this scenario case 2 applies.在这种情况下，案例 2 适用。 The driver does a cache check and either reuses a previous JIT pass output from the cache, or attempts to perform a JIT compile and cache the results if a cache miss occurs.驱动程序会进行缓存检查，并重用缓存中先前的 JIT 传递输出，或者在发生缓存未命中时尝试执行 JIT 编译并缓存结果。 If the PTX is valid and compatible, the kernel runs.如果 PTX 有效且兼容，则内核运行。

From my empirical observations, the answer seems to be:根据我的经验观察，答案似乎是：

Compilation with编译	JIT cache used?使用 JIT 缓存？
NVRTC targeting concrete architecture针对具体架构的 NVRTC	No不
NVRTC targeting "virtual" architecture NVRTC 瞄准“虚拟”架构	??? ？？？
Loading a module with the CUDA driver使用 CUDA 驱动程序加载模块	Yes是的