简体繁体 English

流式多处理器、块和线程 (CUDA)

[英]Streaming multiprocessors, Blocks and Threads (CUDA)

原文 2010-08-19 07:21:39 2 4 cuda/ nvidia

What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads? CUDA 核心、流式多处理器和 CUDA 块和线程模型之间有什么关系？

What gets mapped to what and what is parallelized and how?什么映射到什么，什么是并行化的，以及如何并行化的？ and what is more efficient, maximize the number of blocks or the number of threads?什么更有效，最大化块数或线程数？

My current understanding is that there are 8 cuda cores per multiprocessor.我目前的理解是每个多处理器有 8 个 cuda 核心。 and that every cuda core will be able to execute one cuda block at a time.并且每个 cuda 核心将能够一次执行一个 cuda 块。 and all the threads in that block are executed serially in that particular core.并且该块中的所有线程都在该特定内核中串行执行。

Is this correct?这样对吗？

4 个解决方案

The thread / block layout is described in detail in the CUDA programming guide . CUDA 编程指南中详细描述了线程/块布局。 In particular, chapter 4 states:特别是，第 4 章指出：

The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). CUDA 架构围绕可扩展的多线程流式多处理器 (SM) 阵列构建。 When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity.当主机 CPU 上的 CUDA 程序调用内核网格时，网格的块被枚举并分配给具有可用执行能力的多处理器。 The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor.一个线程块的线程在一个多处理器上并发执行，多个线程块可以在一个多处理器上并发执行。 As thread blocks terminate, new blocks are launched on the vacated multiprocessors.当线程块终止时，新块在腾出的多处理器上启动。

Each SM contains 8 CUDA cores, and at any one time they're executing a single warp of 32 threads - so it takes 4 clock cycles to issue a single instruction for the whole warp.每个 SM 包含 8 个 CUDA 核心，并且在任何时候它们都在执行 32 个线程的单个扭曲 - 因此需要 4 个时钟周期来为整个扭曲发出一条指令。 You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use __syncthreads() .您可以假设任何给定 warp 中的线程都以锁步执行，但要跨 warp 同步，您需要使用__syncthreads() 。

For the GTX 970 there are 13 Streaming Multiprocessors (SM) with 128 Cuda Cores each.对于 GTX 970，有 13 个流式多处理器 (SM)，每个处理器有 128 个 Cuda 核心。 Cuda Cores are also called Stream Processors (SP). Cuda 核心也称为流处理器 (SP)。

You can define grids which maps blocks to the GPU.您可以定义将块映射到 GPU 的网格。

You can define blocks which map threads to Stream Processors (the 128 Cuda Cores per SM).您可以定义将线程映射到流处理器（每个 SM 128 个 Cuda 核心）的块。

One warp is always formed by 32 threads and all threads of a warp are executed simulaneously.一个 warp 总是由 32 个线程组成，并且一个 warp 的所有线程同时执行。

To use the full possible power of a GPU you need much more threads per SM than the SM has SPs.要使用 GPU 的全部可能功能，每个 SM 需要比 SM 具有的 SP 多得多的线程。 For each Compute Capability there is a certain number of threads which can reside in one SM at a time.对于每个计算能力，有一定数量的线程可以一次驻留在一个 SM 中。 All blocks you define are queued and wait for a SM to have the resources (number of SPs free), then it is loaded.您定义的所有块都排队等待 SM 获得资源（空闲的 SP 数量），然后加载。 The SM starts to execute Warps. SM 开始执行 Warps。 Since one Warp only has 32 Threads and a SM has for example 128 SPs a SM can execute 4 Warps at a given time.由于一个 Warp 只有 32 个线程，而一个 SM 有例如 128 个 SP，一个 SM 在给定时间可以执行 4 个 Warp。 The thing is if the threads do memory access the thread will block until its memory request is satisfied.问题是，如果线程进行内存访问，则线程将阻塞，直到满足其内存请求。 In numbers: An arithmetic calculation on the SP has a latency of 18-22 cycles while a non-cached global memory access can take up to 300-400 cycles.在数字上：SP 上的算术计算有 18-22 个周期的延迟，而非缓存全局内存访问可能需要 300-400 个周期。 This means if the threads of one warp are waiting for data only a subset of the 128 SPs would work.这意味着如果一个 warp 的线程正在等待数据，那么只有 128 个 SP 的一个子集可以工作。 Therefor the scheduler switches to execute another warp if available.因此，如果可用，调度程序会切换到执行另一个扭曲。 And if this warp blocks it executes the next and so on.如果这个扭曲阻止它执行下一个等等。 This concept is called latency hiding.这个概念称为延迟隐藏。 The number of warps and the block size determine the occupancy (from how many warps the SM can choose to execute).扭曲的数量和块大小决定了占用率（从 SM 可以选择执行多少扭曲）。 If the occupancy is high it is more unlikely that there is no work for the SPs.如果入住率很高，则 SP 不太可能没有工作。

Your statement that each cuda core will execute one block at a time is wrong.您关于每个 cuda 核心一次执行一个块的说法是错误的。 If you talk about Streaming Multiprocessors they can execute warps from all thread which reside in the SM.如果您谈论流式多处理器，它们可以从驻留在 SM 中的所有线程执行扭曲。 If one block has a size of 256 threads and your GPU allowes 2048 threads to resident per SM each SM would have 8 blocks residing from which the SM can choose warps to execute.如果一个块的大小为 256 个线程，并且您的 GPU 允许每个 SM 驻留 2048 个线程，则每个 SM 将有 8 个块，SM 可以从中选择要执行的扭曲。 All threads of the executed warps are executed in parallel.执行的warp的所有线程都是并行执行的。

You find numbers for the different Compute Capabilities and GPU Architectures here: https://en.wikipedia.org/wiki/CUDA#Limitations您可以在此处找到不同计算能力和 GPU 架构的数字： https : //en.wikipedia.org/wiki/CUDA#Limitations

You can download a occupancy calculation sheet from Nvidia Occupancy Calculation sheet (by Nvidia) .您可以从 Nvidia Occupancy Calculation sheet (by Nvidia)下载占用计算表。

The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...).仅当 SM 有足够的资源用于线程块（共享内存、扭曲、寄存器、屏障等）时，计算工作分配器才会在 SM 上调度线程块 (CTA)。 Thread block level resources such shared memory are allocated.线程块级资源如共享内存被分配。 The allocate creates sufficient warps for all threads in the thread block.分配为线程块中的所有线程创建足够的扭曲。 The resource manager allocates warps using round robin to the SM sub-partitions.资源管理器使用循环法将扭曲分配给 SM 子分区。 Each SM subpartition contains a warp scheduler, register file, and execution units.每个 SM 子分区包含一个 warp 调度程序、寄存器文件和执行单元。 Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture).一旦warp被分配给一个子分区，它将一直留在子分区上，直到它完成或被上下文切换（Pascal架构）抢占。 On context switch restore the warp will be restored to the same SM same warp-id.在上下文切换恢复时，warp 将恢复到相同的 SM 相同的 warp-id。

When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.当warp中的所有线程都完成后，warp调度器等待warp发出的所有未完成的指令完成，然后资源管理器释放warp级资源，包括warp-id和寄存器文件。

When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.当线程块中的所有扭曲完成时，块级资源将被释放，并且 SM 通知计算工作分配器该块已完成。

Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp.一旦 warp 被分配给一个子分区并且所有资源都被分配，warp 就被认为是活动的，这意味着 warp 调度程序正在主动跟踪 warp 的状态。 On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction.在每个周期，warp 调度器确定哪些活动的 warp 被停止，哪些有资格发出指令。 The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. warp 调度器选择优先级最高的合格 warp 并从 warp 发出 1-2 条连续指令。 The rules for dual-issue are specific to each architecture.双重发行的规则特定于每个架构。 If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction.如果warp 发出内存加载，它可以继续执行独立指令，直到到达相关指令。 The warp will then report stalled until the load completes.然后扭曲将报告停止，直到加载完成。 The same is true for dependent math instructions.对于从属数学指令也是如此。 The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps. SM 架构旨在通过在扭曲之间切换每个周期来隐藏 ALU 和内存延迟。

This answer does not use the term CUDA core as this introduces an incorrect mental model.此答案不使用术语 CUDA 核心，因为这引入了不正确的心理模型。 CUDA cores are pipelined single precision floating point/integer execution units. CUDA 内核是流水线单精度浮点/整数执行单元。 The issue rate and dependency latency is specific to each architecture.问题率和依赖延迟特定于每个架构。 Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.每个 SM 子分区和 SM 都有其他执行单元，包括加载/存储单元、双精度浮点单元、半精度浮点单元、分支单元等。

In order to maximize performance the developer has to understand the trade off of blocks vs. warps vs. registers/thread.为了最大限度地提高性能，开发人员必须了解块与扭曲与寄存器/线程之间的权衡。

The term occupancy is the ratio of active warps to maximum warps on a SM.术语占用率是 SM 上的活动扭曲与最大扭曲的比率。 Kepler - Pascal architecture (except GP100) have 4 warp schedulers per SM. Kepler - Pascal 架构（GP100 除外）每个 SM 有 4 个扭曲调度程序。 The minimal number of warps per SM should at least be equal to the number of warp schedulers.每个 SM 的最小扭曲数应至少等于扭曲调度器的数量。 If the architecture has a dependent execution latency of 6 cycles (Maxwell and Pascal) then you would need at least 6 warps per scheduler which is 24 per SM (24 / 64 = 37.5% occupancy) to cover the latency.如果架构具有 6 个周期的相关执行延迟（Maxwell 和 Pascal），那么每个调度程序至少需要 6 个扭曲，即每个 SM 24 个（24 / 64 = 37.5% 占用率）来覆盖延迟。 If the threads have instruction level parallelism then this could be reduced.如果线程具有指令级并行性，那么这可以减少。 Almost all kernels issue variable latency instructions such as memory loads that can take 80-1000 cycles.几乎所有内核都会发出可变延迟指令，例如可能需要 80-1000 个周期的内存加载。 This requires more active warps per warp scheduler to hide latency.这需要每个扭曲调度程序有更多的活动扭曲来隐藏延迟。 For each kernel there is a trade off point between number of warps and other resources such as shared memory or registers so optimizing for 100% occupancy is not advised as some other sacrifice will likely be made.对于每个内核，在扭曲数量和其他资源（例如共享内存或寄存器）之间存在折衷点，因此不建议优化 100% 占用率，因为可能会做出其他一些牺牲。 The CUDA profiler can help identify instruction issue rate, occupancy, and stall reasons in order to help the developer determine that balance. CUDA 分析器可以帮助识别指令发布率、占用率和停顿原因，以帮助开发人员确定平衡。

The size of a thread block can impact performance.线程块的大小会影响性能。 If the kernel has large blocks and uses synchronization barriers then barrier stalls can be a come stall reasons.如果内核有大块并使用同步屏障，那么屏障停顿可能是停顿的原因。 This can be alleviated by reducing the warps per thread block.这可以通过减少每个线程块的扭曲来缓解。

There are multiple streaming multiprocessor on one device.一台设备上有多个流多处理器。
A SM may contain multiple blocks.一个 SM 可能包含多个块。 Each block may contain several threads.每个块可能包含多个线程。
A SM have multiple CUDA cores(as a developer, you should not care about this because it is abstracted by warp), which will work on thread.一个 SM 有多个 CUDA 核心（作为开发人员，你不应该关心这个，因为它是由 warp 抽象的），它将在线程上工作。 SM always working on warp of threads(always 32). SM 总是处理线程的扭曲（总是 32）。 A warp will only working on thread from same block.经纱仅适用于来自同一块的线程。
SM and block both have limits on number of thread, number of register and shared memory. SM 和 block 都有线程数、寄存器数和共享内存的限制。