什么时候应该使用CUDA的内置warpSize而不是我自己的适当常数？

Question

nvcc device code has access to a built-in value, warpSize , which is set to the warp size of the device executing the kernel (ie 32 for the foreseeable future). nvcc设备代码可以访问内置值warpSize ，该值设置为执行内核的设备的扭曲大小（即在可预见的将来为32）。 Usually you can't tell it apart from a constant - but if you try to declare an array of length warpSize you get a complaint about it being non-const... (with CUDA 7.5) 通常，您只能将其与常量区分开来-但是，如果您尝试声明一个长度为warpSize的数组，则会抱怨它是非常量的（使用CUDA 7.5）。

So, at least for that purpose you are motivated to have something like ( edit ): 因此，至少出于这个目的，您有动力拥有类似（编辑）的东西：

enum : unsigned int { warp_size  = 32 };

somewhere in your headers. 标头中的某处。 But now - which should I prefer, and when? 但是，现在-我更喜欢哪个，什么时候？ : warpSize , or warp_size ? ： warpSize还是warp_size ？

Edit: warpSize is apparently a compile-time constant in PTX. 编辑： warpSize显然是PTX中的编译时常量。 Still, the question stands. 问题仍然存在。

Answer 1

Let's get a couple of points straight. 让我们直接讲几点。 The warp size isn't a compile time constant and shouldn't be treated as one. 扭曲大小不是编译时间常数，不应视为一个。 It is an architecture specific runtime immediate constant (and its value just happens to be 32 for all architectures to date). 它是特定于体系结构的运行时即时常量 （对于迄今为止的所有体系结构，其值恰好为32）。 Once upon a time, the old Open64 compiler did emit a constant into PTX, however that changed at least 6 years ago if my memory doesn't fail me. 曾几何时，旧的Open64编译器确实向PTX发出了一个常量，但是至少在6年前，如果我的记忆没有让我失望，它就会改变。

The value is available: 该值可用：

In CUDA C via warpSize , where is is not a compile time constant (the PTX WARP_SZ variable is emitted by the compiler in such cases). 在CUDA C中，通过warpSize ，其中不是编译时间常数 （在这种情况下，编译器会发出PTX WARP_SZ变量）。
In PTX assembler via WARP_SZ , where it is a runtime immediate constant 在通过WARP_SZ PTX汇编器中，它是运行时立即数
From the runtime API as a device property 从运行时API作为设备属性

Don't declare you own constant for the warp size, that is just asking for trouble. 不要声明您自己的变形大小常数，这只是在麻烦。 The normal use case for an in-kernel array dimensioned to be some multiple of the warp size would be to use dynamically allocated shared memory. 尺寸为扭曲大小倍数的内核数组的正常使用情况是使用动态分配的共享内存。 You can read the warp size from the host API at runtime to get it. 您可以在运行时从主机API读取扭曲大小来获取它。 If you have a statically declared in-kernel you need to dimension from the warp size, use templates and select the correct instance at runtime. 如果您有一个静态声明的内核，则需要从扭曲大小开始调整尺寸，使用模板并在运行时选择正确的实例。 The latter might seem like unnecessary theatre, but it is the right thing to do for a use case that almost never arises in practice. 后者似乎是不必要的剧院，但是对于在实践中几乎从未出现过的用例来说，这样做是正确的。 The choice is yours. 这是你的选择。

Answer 2

Contrary to talonmies's answer I find warp_size constant perfectly acceptable. 与talonmies的答案相反，我发现warp_size常数完全可以接受。 The only reason to use warpSize is to make the code forward-compatibly with a possible future hardware that may have warps of different size. 使用warpSize的唯一原因是使代码与将来可能具有不同大小扭曲的硬件向前兼容。 However, when such hardware arrives, the kernel code will most likely require other alterations as well in order to remain efficient. 但是，当此类硬件到货时，内核代码很可能也需要进行其他更改才能保持效率。 CUDA is not a hardware-agnostic language - on the contrary, it is still quite a low-level programming language. CUDA不是与硬件无关的语言-相反，它仍然是相当底层的编程语言。 Production code uses various intrinsic functions that come and go over time (eg __umul24 ). 生产代码使用随时间变化的各种内在函数（例如__umul24 ）。

The day we get a different warp size (eg 64) many things will change: 当我们获得不同的变形大小（例如64）的一天，许多事情都会改变：

The warpSize will have to be adjusted obviously 显然必须调整warpSize
Many warp-level intrinsic will need their signature adjusted, or a new version produced, eg int __ballot , and while int does not need to be 32-bit, it is most commonly so! 许多经水平的内在需要他们签名调整，或未来的新版本，如int __ballot ，虽然int并不需要是32位，这是最常见的左右！
Iterative operations, such as warp-level reductions, will need their number of iterations adjusted. 迭代操作（例如减少经线级）将需要调整其迭代次数。 I have never seen anyone writing: 我从未见过有人写作：
```
 for (int i = 0; i < log2(warpSize); ++i) ... 
```
that would be overly complex in something that is usually a time-critical piece of code. 在通常是时间紧迫的一段代码中，这将过于复杂。
warpIdx and laneIdx computation out of threadIdx would need to be adjusted. warpIdx和laneIdx的计算出threadIdx需要进行调整。 Currently, the most typical code I see for it is: 目前，我看到的最典型的代码是：
```
 warpIdx = threadIdx.x/32; laneIdx = threadIdx.x%32; 
```
which reduces to simple right-shift and mask operations. 简化为简单的右移和遮罩操作。 However, if you replace 32 with warpSize this suddenly becomes a quite expensive operation! 但是，如果将32替换为warpSize这将突然变得相当昂贵！

At the same time, using warpSize in the code prevents optimization, since formally it is not a compile-time known constant. 同时，在代码中使用warpSize阻止优化，因为从形式warpSize ，它不是编译时已知的常量。 Also, if the amount of shared memory depends on the warpSize this forces you to use the dynamically allocated shmem (as per talonmies's answer). 同样，如果共享内存的数量取决于warpSize这将迫使您使用动态分配的shmem（根据talonmies的回答）。 However, the syntax for that is inconvenient to use, especially when you have several arrays -- this forces you to do pointer arithmetic yourself and manually compute the sum of all memory usage. 但是，该语法使用起来很不方便，特别是当您有多个数组时-这迫使您自己进行指针算术并手动计算所有内存使用量的总和。

Using templates for that warp_size is a partial solution, but adds a layer of syntactic complexity needed at every function call: 使用用于warp_size模板是部分解决方案，但会增加每个函数调用所需的语法复杂性层：

deviceFunction<warp_size>(params)

This obfuscates the code. 这会混淆代码。 The more boilerplate, the harder the code is to read and maintain. 样板越多，代码越难读取和维护。

My suggestion would be to have a single header that control all the model-specific constants, eg 我的建议是有一个标头来控制所有特定于模型的常量，例如

#if __CUDA_ARCH__ <= 600
//all devices of compute capability <= 6.0
static const int warp_size = 32; 
#endif

Now the rest of your CUDA code can use it without any syntactic overhead. 现在，您的CUDA代码的其余部分可以使用它，而没有任何语法开销。 The day you decide to add support for newer architecture, you just need to alter this one piece of code. 在决定添加对较新架构的支持的那一天，您只需要更改这一段代码即可。

什么时候应该使用CUDA的内置warpSize而不是我自己的适当常数？

问题描述

2 个解决方案

解决方案1
10

解决方案2
2 已采纳 2017-02-21 16:57:10

什么时候应该使用CUDA的内置warpSize而不是我自己的适当常数？

问题描述

2 个解决方案

解决方案1 10

解决方案2 2 已采纳 2017-02-21 16:57:10

解决方案1
10

解决方案2
2 已采纳 2017-02-21 16:57:10