简体繁体 English

运行更多线程时，CUDA性能会提高

[英]CUDA performance improves when running more threads than there are cores

原文 2012-12-07 14:46:42 3 3 c++/ cuda/ opencl

Why does performance improve when I run more than 32 threads per block? 当每块运行超过32个线程时，为什么性能会提高？

My graphics card has 480 CUDA Cores (15 MS * 32 SP). 我的显卡有480个CUDA核心（15 MS * 32 SP）。

3 个解决方案

Each SM has 1-4 warp schedulers (Tesla = 1, Fermi = 2, Kepler = 4). 每个SM有1-4个warp调度程序（Tesla = 1，Fermi = 2，Kepler = 4）。 Each warp scheduler is responsible for executing a subset of the warps allocated to the SM. 每个warp调度程序负责执行分配给SM的warp的子集。 Each warp scheduler maintains a list of eligible warps. 每个warp调度程序都维护一个符合条件的warp列表。 A warp is eligible if it can issue an instruction on the next cycle. 如果warp可以在下一个周期发出指令，则该warp是合格的。 A warp is not eligible if it is stalled on a data dependency, waiting to fetch and instruction, or the execution unit for the next instruction is busy. 如果warp在数据依赖项上停顿，等待获取和指令，或者下一条指令的执行单元忙，则warp不符合条件。 On each cycle each warp scheduler will pick a warp from the list of eligible warp and issue 1 or 2 instructions. 在每个循环中，每个warp调度程序将从符合条件的warp列表中选择warp并发出1或2个指令。

The more active warps per SM the larger the number of warps each warp scheduler will have to pick from on each cycle. 每个SM的更活跃的warp，每个warp调度程序在每个循环中必须选择的warp数越大。 In most cases, optimal performance is achieved when there is sufficient active warps per SM to have 1 eligible warp per warp scheduler per cycle. 在大多数情况下，当每个SM有足够的活动warp时，每个周期每个warp调度程序有一个符合条件的warp，就可以实现最佳性能。 Increasing occupancy beyond this point does not increase performance and may decrease performance. 超出此点的占用率增加不会提高性能并可能降低性能。

A typical target for active warps is 50-66% of the maximum warps for the SM. 活动经线的典型目标是SM的最大经线的50-66％。 The ratio of warps to maximum warps supported by a launch configuration is called Theoretical Occupancy. 由发射配置支持的经线与最大经线的比率称为理论占用率。 The runtime ratio of active warps per cycle to maximum warps per cycle is Achieved Occupancy. 每个循环的活动扭曲与每个循环的最大扭曲的运行时间比率是达到占用率。 For a GTX480 (CC 2.0 device) a good starting point when designing a kernel is 50-66% Theoretical Occupancy. 对于GTX480（CC 2.0设备），设计内核的良好起点是理论占用率为50-66％。 CC 2.0 SM can have a maximum of 48 warps. CC 2.0 SM最多可以有48个经线。 A 50% occupancy means 24 warps or 768 threads per SM. 50％的占用率意味着每个SM有24个经线或768个线程。

The CUDA Profiling Activity in Nsight Visual Studio Edition can show the theoretical occupancy, achieved occupancy, active warps per SM, eligible warps per SM, and stall reasons. Nsight Visual Studio Edition中的CUDA性能分析活动可以显示理论占用率，实现占用率，每SM的活动warp，每SM的符合条件的warp和停顿原因。

The CUDA Visual Profiler, nvprof, and the command line profiler can show theoretical occupancy, active warps, and achieved occupancy. CUDA Visual Profiler，nvprof和命令行分析器可以显示理论占用率，活动warp和实现占用率。

NOTE: The count of CUDA cores should only be used to compare cards of similar architectures, to calculate theoretical FLOPS, and to potentially compare differences between architectures. 注意：CUDA核心数应仅用于比较类似架构的卡，以计算理论FLOPS，并可能比较架构之间的差异。 Do not use the count when designing algorithms. 在设计算法时不要使用计数。

Welcome to Stack Overflow. 欢迎来到Stack Overflow。 The reason is that CUDA cores are pipelined. 原因是CUDA核心是流水线的。 On Fermi, the pipeline is around 20 clocks long. 在费米，管道长约20个时钟。 This means that to saturate the GPU, you may need up to 20 threads per core. 这意味着要使GPU饱和，每个核心最多可能需要20个线程。

The primary reason is the memory latency hiding model of CUDA. 主要原因是CUDA的内存延迟隐藏模型。 Most modern CPU's use cache to hide the latency to main memory. 大多数现代CPU使用缓存来隐藏主存储器的延迟。 This results in a large percentage of chip resources being devoted to cache. 这导致大量的芯片资源用于缓存。 Most desktop and server processors have several megabytes of cache on the die, which actually accounts for most of the die space. 大多数台式机和服务器处理器在裸片上有几兆字节的缓存，实际上占据了大部分裸片空间。 In order to pack on more cores with the same energy usage and heat dissipation characteristics, CUDA-based chips instead devote their chip space to throwing on tons of CUDA cores (which are mostly just floating-point ALU's.) Since there is very little cache, they instead rely on having more threads ready to run while other threads are waiting on memory accesses to return in order to hide that latency. 为了打包具有相同能量使用和散热特性的更多内核，基于CUDA的芯片反而将其芯片空间用于投入大量CUDA内核（大多数只是浮点ALU）。由于缓存非常少，他们反而依赖于准备好运行更多线程，而其他线程正在等待内存访问返回以隐藏该延迟。 This gives the cores something productive to be working on while some warps are waiting on memory accesses. 这为内核提供了一些工作效率，而一些warp正在等待内存访问。 The more warps per SM, the more chance one of them will be runnable at any given time. 每SM的warp越多，其中一个在任何给定时间都可以运行的机会就越多。

CUDA also has zero-cost thread switching in order to aid in this memory-latency-hiding scheme. CUDA还具有零成本线程切换功能，以帮助实现这种内存延迟隐藏方案。 A normal CPU will incur a large overhead to switch from execution of one thread to the next due to need to store all of the register values for the thread it is switching away from onto the stack and then loading all of the ones for the thread it is switching to. 由于需要将正在切换的线程的所有寄存器值存储到堆栈中，然后为线程加载所有的线程，正常的CPU会产生很大的开销从一个线程的执行切换到下一个线程正在转向。 CUDA SM's just have tons and tons of registers, so each thread has its own set of physical registers assigned to it through the life of the thread. CUDA SM只有大量的寄存器，所以每个线程都有自己的一组物理寄存器，通过线程的生命周期分配给它。 Since there is no need to store and load register values, each SM can execute threads from one warp on one clock cycle and execute threads from a different warp on the very next clock cycle. 由于不需要存储和加载寄存器值，因此每个SM可以在一个时钟周期内从一个warp执行线程，并在下一个时钟周期执行来自不同warp的线程。