[英]In NVIDIA gpu, Can ld/st and arithmetic instruction(such as int32 fp32 )run simultaneously in same sm?
Especially turing and ampere architecture,In the same sm and same warp scheduler,Can the warps run ld/st and other arithmetic instruction simultaneously?尤其是图灵和安培架构,在同一个sm和同一个warp调度器中,warp是否可以同时运行ld/st等算术指令?
I want to know about how warp scheduler work我想知道 warp scheduler 是如何工作的
In the same sm and same warp scheduler,Can the warps run ld/st and other arithmetic instruction simultaneously?
在同一个sm和同一个warp调度器中,warp是否可以同时运行ld/st等算术指令?
No, not if "simultaneously" means "issued in the same clock cycle".不,如果“同时”意味着“在同一时钟周期内发布”,则不是。
In current CUDA GPUs including turing and ampere, when the warp scheduler issues an instruction, it issues the same instruction to all threads in the warp, in any given clock cycle.在当前的 CUDA GPU 中,包括图灵和安培,当 warp 调度程序发出指令时,它会在任何给定的时钟周期内向 warp 中的所有线程发出相同的指令。
Different instructions could be run in different clock cycles (of course) and different instructions can be run in the same clock cycle, if those instructions are issued by different warp schedulers in the SM.不同的指令可以在不同的时钟周期(当然)运行,不同的指令可以在相同的时钟周期运行,如果这些指令是由 SM 中的不同 warp 调度程序发出的。 This would also imply that those instructions are issued to distinct/separate SM units.
这也意味着这些指令是发给不同/单独的 SM 单元的。
So, for example, an integer add instruction issued by warp scheduler 0 would have to be issued to separate functional units compared to a load/store instruction issued by warp scheduler 1 in the same SM.因此,例如,与同一 SM 中的 warp 调度程序 1 发出的加载/存储指令相比,由 warp 调度程序 0 发出的 integer add 指令必须发布到不同的功能单元。 For this example, since the instructions are different, different functional units are needed anyway, and this is self-evident.
对于这个例子,由于指令不同,无论如何都需要不同的功能单元,这是不言而喻的。
But even if both warp schedulers were issuing, for example, FADD (for 2 different warps), they would have to issue to separate floating-point functional units in the SM.但是,即使两个 warp 调度程序都发出 FADD(对于 2 个不同的 warp),它们也必须发出 SM 中单独的浮点功能单元。
In modern CUDA GPUs, due to the partitioning of the SM, each warp scheduler has its own execution resources (functional units) for at least some instruction types, like FADD.在现代 CUDA GPU 中,由于 SM 的分区,每个 warp 调度程序至少对某些指令类型(如 FADD)有自己的执行资源(功能单元)。 So this would happen anyway, again, for this reason, in this example.
因此,出于这个原因,在这个例子中无论如何都会发生这种情况。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.