简体   繁体   English

在 Nvidia 的 NVCC 编译器中使用多个“arch”标志的目的是什么?

[英]What is the purpose of using multiple “arch” flags in Nvidia's NVCC compiler?

I've recently gotten my head around how NVCC compiles CUDA device code for different compute architectures.我最近开始了解 NVCC 如何为不同的计算架构编译 CUDA 设备代码。

From my understanding, when using NVCC's -gencode option, "arch" is the minimum compute architecture required by the programmer's application, and also the minimum device compute architecture that NVCC's JIT compiler will compile PTX code for.根据我的理解,当使用 NVCC 的 -gencode 选项时,“arch”是程序员应用程序所需的最低计算架构,也是 NVCC 的 JIT 编译器将为其编译 PTX 代码的最低设备计算架构。

I also understand that the "code" parameter of -gencode is the compute architecture which NVCC completely compiles the application for, such that no JIT compilation is necessary.我也明白 -gencode 的“代码”参数是 NVCC 完全编译应用程序的计算架构,因此不需要 JIT 编译。

After inspection of various CUDA project Makefiles, I've noticed the following occur regularly:在检查了各种 CUDA 项目 Makefile 后,我注意到以下情况经常发生:

-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_21,code=sm_21

and after some reading, I found that multiple device architectures could be compiled for in a single binary file - in this case sm_20, sm_21.经过一些阅读,我发现可以在单个二进制文件中编译多个设备架构 - 在本例中为 sm_20、sm_21。

My questions are why are so many arch / code pairs necessary?我的问题是为什么需要这么多架构/代码对? Are all values of "arch" used in the above?上面是否使用了“arch”的所有值?

what is the difference between that and say:那和说有什么区别:

-arch compute_20
-code sm_20
-code sm_21

Is the earliest virtual architecture in the "arch" fields selected automatically, or is there some other obscure behaviour? “arch”字段中最早的虚拟架构是自动选择的,还是有其他一些晦涩的行为?

Is there any other compilation and runtime behaviour I should be aware of?是否还有我应该注意的其他编译和运行时行为?

I've read the manual, http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-compilation and I'm still not clear regarding what happens at compilation or runtime.我已经阅读了手册, http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-compilation ,但我仍然不清楚编译或运行时会发生什么。

Roughly speaking, the code compilation flow goes like this:粗略地说,代码编译流程是这样的:

CUDA C/C++ device code source --> PTX --> SASS CUDA C/C++设备代码源--> PTX --> SASS

The virtual architecture (eg compute_20 , whatever is specified by -arch compute... ) determines what type of PTX code will be generated.虚拟架构(例如, compute_20 ,无论是由compute_20 -arch compute...指定的任何内容)决定将生成什么类型​​的 PTX 代码。 The additional switches (eg -code sm_21 ) determine what type of SASS code will be generated.附加开关(例如-code sm_21 )确定将生成什么类型​​的 SASS 代码。 SASS is actually executable object code for a GPU (machine language). SASS 实际上是 GPU(机器语言)的可执行目标代码。 An executable can contain multiple versions of SASS and/or PTX, and there is a runtime loader mechanism that will pick appropriate versions based on the GPU actually being used.一个可执行文件可以包含多个版本的 SASS 和/或 PTX,并且有一个运行时加载器机制可以根据实际使用的 GPU 选择合适的版本。

As you point out, one of the handy features of GPU operation is JIT-compile.正如您所指出的,GPU 操作的一项便利功能是 JIT 编译。 JIT-compile will be done by the GPU driver (does not require the CUDA toolkit to be installed) anytime a suitable PTX code is available but a suitable SASS code is not. JIT 编译将由 GPU 驱动程序完成(不需要安装 CUDA 工具包),只要有合适的 PTX 代码可用,但合适的 SASS 代码不可用。 The definition of a "suitable PTX" code is one which is numerically equal to or lower than the GPU architecture being targeted for running the code. “合适的 PTX”代码的定义是在数值上等于或低于运行代码的目标 GPU 架构的代码。 To pick an example, specifying arch=compute_30,code=compute_30 would tell nvcc to embed cc3.0 PTX code in the executable.举个例子,指定arch=compute_30,code=compute_30会告诉 nvcc 在可执行文件中嵌入 cc3.0 PTX 代码。 This PTX code could be used to generate SASS code for any future architecture that the GPU driver supports.此 PTX 代码可用于为 GPU 驱动程序支持的任何未来架构生成 SASS 代码。 Currently this would include architectures like Pascal, Volta, Turing, etc. assuming the GPU driver supports those architectures.目前,这将包括 Pascal、Volta、Turing 等架构,假设 GPU 驱动程序支持这些架构。

One advantage of including multiple virtual architectures (ie multiple versions of PTX), then, is that you have executable compatibility with a wider variety of target GPU devices (although some devices may trigger a JIT-compile to create the necessary SASS).包含多个虚拟架构(即 PTX 的多个版本)的一个优点是,您可以与更广泛的目标 GPU 设备具有可执行兼容性(尽管某些设备可能会触发 JIT 编译以创建必要的 SASS)。

One advantage of including multiple "real GPU targets" (ie multiple SASS versions) is that you can avoid the JIT-compile step, when one of those target devices is present.包含多个“真实 GPU 目标”(即多个 SASS 版本)的优势之一是,当存在这些目标设备之一时,您可以避免 JIT 编译步骤。

If you specify a bad set of options, it's possible to create an executable that won't run (correctly) on a particular GPU.如果您指定了一组错误的选项,则可能会创建一个无法(正确)在特定 GPU 上运行的可执行文件。

One possible disadvantage of specifying a lot of these options is code size bloat.指定大量这些选项的一个可能的缺点是代码大小膨胀。 Another possible disadvantage is compile time, which will generally be longer as you specify more options.另一个可能的缺点是编译时间,当您指定更多选项时,编译时间通常会更长。

It's also possible to create excutables that contain no PTX, which may be of interest to those trying to obscure their IP.也可以创建不包含 PTX 的可执行文件,这可能对那些试图隐藏其 IP 的人感兴趣。

Creating PTX suitable for JIT should be done by specifying a virtual architecture for the code switch.创建适用于 JIT 的 PTX 应该通过为code切换指定一个虚拟架构来完成。

The purpose of multiple -arch flags is to use the __CUDA_ARCH__ macro for conditional compilation (ie, using #ifdef ) of differently-optimized code paths.的多个目的-arch标志是使用__CUDA_ARCH__条件编译宏(即,使用#ifdef的不同优化的代码路径)。

See here: http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-identification-macro请参阅此处: http : //docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-identification-macro

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM