简体繁体 English

NVIDIA GPU 中的 if 语句是如何执行的？

[英]How is if statement executed in NVIDIA GPUs?

原文 2022-10-05 16:41:46 2 1 c++/ cuda/ gpu/ nvidia/ cpu-architecture

As much as know GPU cores are very simple and can only execute basic mathematic instructions.据了解，GPU个内核非常简单，只能执行基本的数学指令。 If I have a kernel with an if statement, then what does execute that if statement?如果我有一个带有 if 语句的 kernel，那么执行该 if 语句的是什么？ Fp32, Fp64 and Int32 can only execute operations with floats, doubles and integers, not a COMPARE instruction, am I wrong. Fp32、Fp64 和 Int32 只能执行浮点数、双精度数和整数运算，而不是 COMPARE 指令，我错了吗？ What happens if I have printf function in kernel?如果我在 kernel 中有 printf function 会怎样？ Who executes that.谁执行的。

1 个解决方案

Compare instructions are arithmetic instructions, you can implement a comparison with subtraction and a flag register, and GPGPUs have them.比较指令是算术指令，可以用减法和标志寄存器来实现比较，GPGPU有。
But they are often not advertised as much as the number-crunching capability of the whole GPU.但它们通常没有像整个 GPU 的数字运算能力那样被宣传。

NVIDIA doesn't publish the machine code documentation for their GPUs nor the ISA of the respective assembly (called SASS). NVIDIA 不发布其 GPU 的机器代码文档，也不发布相应程序集（称为 SASS）的 ISA。
Instead, NVIDIA maintains the PTX language which is designed to be more portable across different generations while still being very close to the actual machine code.相反，NVIDIA 保留了 PTX 语言，该语言旨在在不同世代之间具有更高的可移植性，同时仍然非常接近实际的机器代码。

PTX is a predicated architecture . PTX 是一种预测架构。 The setp instruction (which again, is just a subtraction with a few caveats) sets the value of the defined predicate registers and these are used to conditionally execute other instructions. setp指令（同样，它只是一个带有一些注意事项的减法）设置定义的谓词寄存器的值，这些用于有条件地执行其他指令。 Including the bra instruction which is a branch, making it possible to execute conditional branches.包括作为分支的bra指令，可以执行条件分支。

One could argue that PTX is not SASS but it seems the predicate architecture is what NVIDIA GPUs, at least, used to do .有人可能会争辩说 PTX 不是 SASS，但谓词架构似乎至少是 NVIDIA GPU曾经做过的事情。

AMD GPUs seem to use the traditional approach to branching : there are comparison instructions (eg S_CMP_EQ_U64 ) and conditional branches (eg S_CBRANCH_SCCZ ). AMD GPU 似乎使用传统的分支方法：有比较指令（例如S_CMP_EQ_U64 ）和条件分支（例如S_CBRANCH_SCCZ ）。

Intel GPUs also rely on predication but have different instructions for divergent vs non-divergent branches. 英特尔 GPU 也依赖于预测，但对发散分支和非发散分支有不同的指令。

So GPGPUs do have branch instructions, in fact, their SIMT model has to deal with the branch divergence problem .所以GPGPU是有分支指令的，其实他们的SIMT model就是要处理分支发散问题的。

Before c. 2006 GPUs were not fully programmable and programmers had to rely on other tricks (like data masking or branchless code) to implement their kernel.在 c 之前。2006 GPU 不是完全可编程的，程序员不得不依靠其他技巧（如数据屏蔽或无分支代码）来实现他们的 kernel。
Keep in mind that at the time it was not widely accepted that one could execute arbitrary programs or make arbitrary shading effects with GPUs.请记住，当时人们还没有广泛接受可以使用 GPU 执行任意程序或制作任意阴影效果的说法。 GPUs relaxed their programming constraints with time.随着时间的推移，GPU 放宽了它们的编程限制。

Putting a printf in a CUDA kernel won't probably work because there is no C runtime on the GPU (remember the GPU is an entirely different executor from the CPU) and the linking would fail I guess.将printf放入 CUDA kernel 中可能不会起作用，因为 GPU 上没有 C 运行时（记住 GPU 是与 CPU 完全不同的执行程序），我猜链接会失败。
You can theoretically force a GPU implementation of the CRT and design a mechanism to call syscalls from the GPU code but that would be unimaginably slow since GPUs are not designed for this kind of work.从理论上讲，您可以强制执行 CRT 的 GPU 并设计一种机制来从 GPU 代码调用系统调用，但这会慢得难以想象，因为 GPU 不是为此类工作而设计的。
EDIT : Apparently NVIDIA actually did implement a printf on the GPU that prints to a buffer shared with host.编辑：显然 NVIDIA 实际上确实在printf ，它打印到与主机共享的缓冲区。
The problem here is not the presence of branches but the very nature of printf .这里的问题不是分支的存在，而是printf的本质。