CUDA JIT 编译器是否执行设备链接时优化？

Question

Before device link-time optimization (DLTO) was introduced in CUDA 11.2, it was relatively easy to ensure forward compatibility without worrying too much about differences in performance.在 CUDA 11.2 中引入设备链接时优化 (DLTO) 之前，确保前向兼容性相对容易，而无需过多担心性能差异。 You would typically just create a fatbinary containing PTX for the lowest possible arch and SASS for the specific architectures you would normally target.您通常只需为可能的最低架构创建一个包含 PTX 的胖二进制文件，并为您通常针对的特定架构创建 SASS。 For any future GPU architectures, the JIT compiler would then assemble the PTX into SASS optimized for that specific GPU arch.对于任何未来的 GPU 架构，JIT 编译器随后会将 PTX 组装到针对特定 GPU 架构优化的 SASS 中。

Now, however, with DLTO, it is less clear to me how to ensure forward compatibility and maintain performance on those future architectures.然而，现在，对于 DLTO，我不太清楚如何确保前向兼容性并保持这些未来架构的性能。

Let's say I compile/link an application using nvcc with the following options:假设我使用带有以下选项的nvcc编译/链接应用程序：

Compile编译

-gencode=arch=compute_52,code=[compute_52,lto_52]
-gencode=arch=compute_61,code=lto_61

Link关联

-gencode=arch=compute_52,code=[sm_52,sm_61] -dlto

This will create a fatbinary containing PTX for cc_52 , LTO intermediaries for sm_52 and sm_61 , and link-time optimized SASS for sm_52 and sm_61 (or at least this appears to be the case when dumping the resulting fatbin sections using cuobjdump -all anyway).这将创建一个包含 cc_52 的 PTX、sm_52 和cc_52的 LTO 中介以及sm_52和sm_61的链接时间优化的sm_61 sm_52或者至少在使用cuobjdump -all转储生成的 fatbin 部分时似乎是这种情况）。

Assuming the above is correct, what happens when the application is run on a later GPU architecture (eg sm_70 )?假设上述是正确的，当应用程序在更高版本的 GPU 架构（例如sm_70 ）上运行时会发生什么？ Does the JIT compiler just assemble the cc_52 PTX without using link-time optimization (resulting in less optimal code)? JIT 编译器是否只是组装cc_52 PTX 而不使用链接时优化（导致不太优化的代码）？ Or does it somehow link the LTO intermediaries using link-time optimization?或者它是否使用链接时间优化以某种方式链接 LTO 中介？ Is there a way to determine/guide what the JIT compiler is doing?有没有办法确定/指导 JIT 编译器在做什么？

Answer 1

According to an NVIDIA employee on the CUDA forums the answer is "not yet":根据CUDA 论坛上的 NVIDIA 员工的说法，答案是“还没有”：

Good question.好问题。 We are working on support for JIT LTO, but in 11.2 it is not supported.我们正在努力支持 JIT LTO，但在 11.2 中不支持。 So in the example you give at JIT time it will JIT each individual PTX to cubin and then do a cubin link.因此，在您在 JIT 时间给出的示例中，它将 JIT 每个单独的 PTX 到 cubin，然后执行 cubin 链接。 This is the same as we have always done for JIT linking.这与我们一直为 JIT 链接所做的相同。 But we should have more support for JIT LTO in future releases.但是我们应该在未来的版本中更多地支持 JIT LTO。

CUDA JIT 编译器是否执行设备链接时优化？

问题描述

1 个解决方案

解决方案1
3 已采纳 2021-05-17 07:48:43

CUDA JIT 编译器是否执行设备链接时优化？

问题描述

1 个解决方案

解决方案1 3 已采纳 2021-05-17 07:48:43

解决方案1
3 已采纳 2021-05-17 07:48:43