简体   繁体   English

NVCC编译选项,用于生成最佳代码(使用JIT)

[英]NVCC compilation options for generating the best code (using JIT)

I am trying to understand nvcc compilation phases but I am a little bit confused. 我试图理解nvcc编译阶段,但我有些困惑。 Because I don't know the exact hardware configuration of the machine that will run my software, I want to use JIT compilation feature in order to generate the best possible code for it. 因为我不知道运行我的软件的计算机的确切硬件配置,所以我想使用JIT编译功能来为其生成最佳的代码。 In the NVCC documentation I found this: 在NVCC文档中,我发现了这一点:

"For instance, the command below allows generation of exactly matching GPU binary code, when the application is launched on an sm_10, an sm_13, and even a later architecture:" “例如,当在sm_10,sm_13甚至更高版本的架构上启动应用程序时,下面的命令允许生成完全匹配的GPU二进制代码:”

nvcc x.cu -arch=compute_10 -code=compute_10

So my understanding is that the above options will produce the best/fastest/optimum code for the current GPU. 因此,我的理解是上述选项将为当前GPU产生最佳/最快/最佳代码。 Is that correct? 那是对的吗? I also read that the default nvcc options are: 我还读到默认的nvcc选项是:

nvcc x.cu –arch=compute_10 -code=sm_10,compute_10

If the above is indeed correct, why I can't use any compute_20 features in my application? 如果以上内容确实正确,为什么我不能在应用程序中使用任何compute_20功能?

When you specify a target architecture you are restricting yourself to the features available in that architecture. 指定目标体系结构时,您将只能使用该体系结构中可用的功能。 That's because the PTX code is a virtual assembly code, so you need to know the features available during PTX generation. 这是因为PTX代码是虚拟的汇编代码,因此您需要了解PTX生成过程中可用的功能。 The PTX will be JIT compiled to the GPU binary code (SASS) for whatever GPU you are running on, but it can't target newer architecture features. 无论您在哪个GPU上运行,PTX都将通过JIT编译为GPU二进制代码(SASS),但不能针对较新的体系结构功能。

I suggest that you pick a minimum architecture (for example, 1.3 if you want double precision or 2.0 if you want a Fermi-or-later feature) and then create PTX for that architecture AND newer base architectures. 我建议您选择一个最小的体系结构(例如,如果需要双精度,则为1.3;如果要使用费米或更高的特性,则为2.0),然后为该体系结构和较新的基础体系结构创建PTX。 You can do this in one command (although it will take longer since it requires multiple passes through the code) and bundle everything into a single fat binary. 您可以在一个命令中执行此操作(尽管它需要更长的时间,因为它需要多次遍历代码),并将所有内容捆绑到一个单独的二进制文件中。

An example command line may be: 示例命令行可能是:

nvcc <general options> <filename.cu> \
    -gencode arch=compute_13,code=compute_13 \
    -gencode arch=compute_20,code=compute_20 \
    -gencode arch=compute_30,code=compute_30 \
    -gencode arch=compute_35,code=compute_35

That will create four PTX versions in the binary. 这将在二进制文件中创建四个PTX版本。 You could also compile to selected GPUs at the same time which has the advantage of avoiding the JIT compile time for your users but also grows your binary size. 您还可以同时编译到选定的GPU,这具有为用户避免JIT编译时间的优点,但同时也会增加二进制文件的大小。

Check out the NVCC manual for more information on this. 有关更多信息,请查阅NVCC手册

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM