简体   繁体   English

在NVIDIA gpu中,ld/st和算术指令(比如int32 fp32)可以在同一个sm中同时运行吗?

[英]In NVIDIA gpu, Can ld/st and arithmetic instruction(such as int32 fp32 )run simultaneously in same sm?

Especially turing and ampere architecture,In the same sm and same warp scheduler,Can the warps run ld/st and other arithmetic instruction simultaneously?尤其是图灵和安培架构,在同一个sm和同一个warp调度器中,warp是否可以同时运行ld/st等算术指令?

I want to know about how warp scheduler work我想知道 warp scheduler 是如何工作的

In the same sm and same warp scheduler,Can the warps run ld/st and other arithmetic instruction simultaneously?在同一个sm和同一个warp调度器中,warp是否可以同时运行ld/st等算术指令?

No, not if "simultaneously" means "issued in the same clock cycle".不,如果“同时”意味着“在同一时钟周期内发布”,则不是。

In current CUDA GPUs including turing and ampere, when the warp scheduler issues an instruction, it issues the same instruction to all threads in the warp, in any given clock cycle.在当前的 CUDA GPU 中,包括图灵和安培,当 warp 调度程序发出指令时,它会在任何给定的时钟周期内向 warp 中的所有线程发出相同的指令。

Different instructions could be run in different clock cycles (of course) and different instructions can be run in the same clock cycle, if those instructions are issued by different warp schedulers in the SM.不同的指令可以在不同的时钟周期(当然)运行,不同的指令可以在相同的时钟周期运行,如果这些指令是由 SM 中的不同 warp 调度程序发出的。 This would also imply that those instructions are issued to distinct/separate SM units.这也意味着这些指令是发给不同/单独的 SM 单元的。

So, for example, an integer add instruction issued by warp scheduler 0 would have to be issued to separate functional units compared to a load/store instruction issued by warp scheduler 1 in the same SM.因此,例如,与同一 SM 中的 warp 调度程序 1 发出的加载/存储指令相比,由 warp 调度程序 0 发出的 integer add 指令必须发布到不同的功能单元。 For this example, since the instructions are different, different functional units are needed anyway, and this is self-evident.对于这个例子,由于指令不同,无论如何都需要不同的功能单元,这是不言而喻的。

But even if both warp schedulers were issuing, for example, FADD (for 2 different warps), they would have to issue to separate floating-point functional units in the SM.但是,即使两个 warp 调度程序都发出 FADD(对于 2 个不同的 warp),它们也必须发出 SM 中单独的浮点功能单元。

In modern CUDA GPUs, due to the partitioning of the SM, each warp scheduler has its own execution resources (functional units) for at least some instruction types, like FADD.在现代 CUDA GPU 中,由于 SM 的分区,每个 warp 调度程序至少对某些指令类型(如 FADD)有自己的执行资源(功能单元)。 So this would happen anyway, again, for this reason, in this example.因此,出于这个原因,在这个例子中无论如何都会发生这种情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在CPU和GPU上同时运行Keras? - How to run Keras on a CPU and a GPU simultaneously? 在NVIDIA gpu中,为什么运行时间随着线程数增加到gpu核心的3倍? - In NVIDIA gpu, Why is the elapse time the same as the number of thread increase to 3 times of gpu core? 考虑到 SM 和 SP 的数量,GPU 上同时运行多少? - How much is run concurrently on a GPU given its numbers of SM's and SP's? 如何强制32个进程在32个不同的内核上运行? - How force 32 processes to run on 32 different cores? 为什么相同的 OpenCL 代码与 Intel Xeon CPU 和 NVIDIA GTX 1080 Ti GPU 的输出不同? - Why does the same OpenCL code have different outputs from Intel Xeon CPU and NVIDIA GTX 1080 Ti GPU? 默认情况下,TensorFlow是同时使用GPU / CPU进行计算还是仅使用GPU? - By default, does TensorFlow use GPU/CPU simultaneously for computing or GPU only? 如何修改此组合算法以在启用了cuda的GPU上并行运行? - How can this combination algorithm be modified to run in parallel on a cuda enabled gpu? 与FileLock()同步的多进程在同一文件上读写c ++ win32 - syncro multiprocess with FileLock() read/write on same file c++ win32 如何使用 CPU 运行一个进程并使用 GPU 运行另一个进程? - How can I run a process with CPU and another process with GPU? OpenCL-GPU矢量数学(指令级并行性) - OpenCL - GPU Vector Math (Instruction Level Parallelism)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM