[英]Can CUDA cores run things absolutely parallel or do they need context switching?
Can a CUDA INT32 Core process two different integer instructions completelly parallel, without context switching? CUDA INT32 核心可以完全并行处理两个不同的 integer 指令,而无需上下文切换吗? I know that it is not possible on a CPU, but on a NVIDIA GPU?
我知道这在 CPU 上是不可能的,但在 NVIDIA GPU 上是不可能的? I know that a SM can run warps, and if core has to wait for some information, then a it gets another thread from the dispatch unit.
我知道 SM 可以运行扭曲,如果核心必须等待一些信息,那么它会从调度单元获取另一个线程。
I know that it is not possible on a CPU, but on a NVIDIA GPU?
我知道这在 CPU 上是不可能的,但在 NVIDIA GPU 上是不可能的?
This assertion is wrong on modern mainstream CPUs (eg. since at least a decade for nearly all x86-64 processors, starting from Intel Skylake or AMD Zen 2).这种断言在现代主流 CPU 上是错误的(例如,从 Intel Skylake 或 AMD Zen 2 开始,几乎所有 x86-64 处理器至少十年以来)。 Indeed, modern x86-64 Intel/AMD processor can generally compute 2 (256 AVX) SIMD vectors in parallel since there is generally 2 SIMD units.
实际上,现代 x86-64 Intel/AMD 处理器通常可以并行计算 2 个(256 AVX)SIMD 向量,因为通常有 2 个 SIMD 单元。 Processors like Intel Skylake also have 4 ALU units capable of computing 4 basic arithmetic operations (eg. add, sub, and, xor) in parallel per cycle.
像英特尔 Skylake 这样的处理器也有 4 个 ALU 单元,能够在每个周期并行计算 4 种基本算术运算(例如,加法、减法和异或)。 Some instruction like division are far more expensive and do not run in parallel on such architecture though it is well pipelined.
像除法这样的一些指令要昂贵得多,并且尽管流水线很好,但不能在这种架构上并行运行。 The instructions can come from the same thread on the same logical cores or possibly 2 threads (of possibly 2 different processes) scheduled on 2 logical cores without any context switches.
指令可以来自相同逻辑核心上的相同线程,或者可能来自调度在 2 个逻辑核心上的 2 个线程(可能有 2 个不同的进程),而无需任何上下文切换。 Note that recent high-end ARM processors can also do this (even some mobile processors).
请注意,最近的高端 ARM 处理器也可以做到这一点(甚至一些移动处理器)。
Can a CUDA INT32 Core process two different integer instructions completelly parallel, without context switching?
CUDA INT32 核心可以完全并行处理两个不同的 integer 指令,而无需上下文切换吗?
NVIDIA GPUs execute groups of threads known as warps in SIMT (Single Instruction, Multiple Thread) fashion. NVIDIA GPU 以SIMT (单指令多线程)方式执行称为 warp 的线程组。 Thus, 1 instruction operate on 32 items in parallel (though, theoretically, an hardware can be free not to do that completely in parallel).
因此,1 条指令并行操作 32 个项目(尽管从理论上讲,硬件可以自由地不完全并行地执行此操作)。 A kernel execution basically contains many block and blocks are scheduled to SM.
一个 kernel 执行基本上包含许多块并且块被调度到 SM。 An SM can operate on many blocks concurrently so there is a massive amount of parallelism available.
一个 SM 可以同时对许多块进行操作,因此有大量可用的并行性。
Whether a specific GPU can execute two INT32 warp in parallel it is dependent of the target architecture , not CUDA itself.特定的 GPU 是否可以并行执行两个 INT32 扭曲取决于目标架构,而不是 CUDA 本身。 On modern Nvidia GPUs, each SM can be split in multiple partitions that can each execute instructions on blocks independently of the other partitions.
在现代 Nvidia GPU 上,每个 SM 可以分成多个分区,每个分区可以独立于其他分区执行块上的指令。 For example, AFAIK, on a Pascal GP104, there is 20 SM and each SM has 4 partition capable of running SIMD instructions operating on 1 warp (32 items) at time.
例如,AFAIK,在 Pascal GP104 上,有 20 个 SM,每个 SM 有 4 个分区,能够运行 SIMD 指令,同时在 1 个 warp(32 个项目)上运行。 In practice, things can be a bit more complex on newer architectures.
在实践中,在较新的架构上事情可能会更复杂一些。 You can get more information here .
您可以在此处获得更多信息。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.