简体繁体 English

在什么类型的循环中最好在CUDA中使用#pragma unroll指令？

[英]In what types of loops is it best to use the #pragma unroll directive in CUDA?

原文 2012-11-04 19:43:10 4 2 optimization/ cuda/ loop-unrolling

In CUDA it is possible to unroll loops using the #pragma unroll directive to improve performance by increasing instruction level parallelism. 在CUDA中，可以使用#pragma unroll指令展开循环，以通过提高指令级并行性来提高性能。 The #pragma can optionally be followed by a number that specifies how many times the loop must be unrolled. #pragma可以选择后跟一个数字，该数字指定必须展开循环的次数。

Unfortunately the docs do not give specific directions on when this directive should be used. 不幸的是，文档没有给出关于何时应该使用该指令的具体指示。 Since small loops with a known trip count are already unrolled by the compiler, should #pragma unroll be used on larger loops? 由于编译器已经展开了具有已知行程计数的小循环，因此#pragma unroll是否应该用于较大的循环？ On small loops with a variable counter? 在带有可变计数器的小循环上？ And what about the optional number of unrolls? 那么可选的展开数量呢？ Also is there recommended documentation about cuda specific loop unrolling? 还有关于cuda特定循环展开的推荐文档吗？

2 个解决方案

There aren't any fast and hard rules. 没有任何快速和严格的规则。 The CUDA compiler has at least two unrollers, one each inside the NVVM or Open64 frontends, and one in the PTXAS backend. CUDA编译器至少有两个展开器，一个在NVVM或Open64前端内，另一个在PTXAS后端。 In general, they tend to unroll loops pretty aggressively, so I find myself using #pragma unroll 1 (to prevent unrolling) more often than any other unrolling attribute. 一般来说，他们倾向于非常积极地展开循环，所以我发现自己使用#pragma unroll 1 （以防止展开）比任何其他展开属性更频繁。 The reasons for turning off loop unrolling are twofold: 关闭循环展开的原因有两个：

(1) When a loop is unrolled completely, register pressure can increase. （1）当循环完全展开时，套准压力会增加。 For example, indexes into small local memory arrays may become compile-time constants, allowing the compiler to place the local data into registers. 例如，小型本地内存数组的索引可能成为编译时常量，允许编译器将本地数据放入寄存器。 Complete unrolling may also tends to lengthen basic blocks, allowing more aggressive scheduling of texture and global loads, which may require additional temporary variables and thus registers. 完全展开也可能会延长基本块，允许更积极地调度纹理和全局负载，这可能需要额外的临时变量，因此需要寄存器。 Increased register pressure can lead to lower performance due to register spilling. 由于寄存器溢出，寄存器压力增加会导致性能降低。

(2) Partially unrolled loops usually require a certain amount of pre-computation and clean-up code to handle loop counts that are not an exactly a multiple of the unrolling factor. （2）部分展开的循环通常需要一定量的预计算和清理代码来处理不完全是展开因子倍数的循环计数。 For loops with short trip counts, this overhead can swamp any performance gains to be had from the unrolled loop, leading to lower performance after unrolling. 对于具有短行程计数的循环，此开销可以淹没从展开的循环中获得的任何性能增益，导致展开后的性能降低。 While the compiler contains heuristics for finding suitable loops under these restrictions, the heuristics can't always provide the best decision. 虽然编译器包含用于在这些限制下找到合适循环的启发式方法，但启发式方法并不总能提供最佳决策。

In rare cases I have found that manually providing a higher unrolling factor than what the compiler used automatically has a small beneficial effect on performance (with typical gain in the single digit percent). 在极少数情况下，我发现手动提供比编译器自动使用的更高的展开因子对性能有一个小的有益影响（典型增益为单位数百分比）。 These are typically cases of memory-intensive code where a larger unrolling factor allows more aggressive scheduling of global or texture loads, or very tight computationally bound loops that benefit from minimization of the loop overhead. 这些通常是内存密集型代码的情况，其中较大的展开因子允许更积极地调度全局或纹理负载，或者非常紧密的计算绑定循环，其受益于最小化循环开销。

Playing with unrolling factors is something that should happen late in the optimization process, as the compiler defaults cover most cases one will encounter in practice. 使用展开因子是应该在优化过程的后期发生的事情，因为编译器默认涵盖了在实践中将遇到的大多数情况。

It's a tool that you can use to unroll loops. 它是一个可用于展开循环的工具。 The specifics of when it should/shouldn't be used will vary a lot depending on your code (what's inside the loop for instance). 应该/不应该使用它的具体细节将根据您的代码（例如循环内部的内容）而有很大差异。 There aren't really any good generic tips except think of what your code would be like unrolled vs rolled and think if it would be better unrolled. 除了考虑你的代码将展开与滚动之类的内容之外，还没有任何好的通用技巧，并认为它是否会更好地展开。