简体繁体 English

为什么在CUDA中启动多个32个线程？

[英]Why launch a multiple of 32 number of threads in CUDA?

原文 2014-10-28 14:44:27 0 1 parallel-processing/ cuda

我参加了CUDA并行编程课程，我已经看到很多CUDA线程配置的例子，其中通常将所需的线程数量向上舍入为32的最接近的倍数。我理解线程被分组为warp，如果你启动1000个线程，GPU无论如何都会将其四舍五入，那么为什么要明确呢？

1 个解决方案

The advice is generally given in the context of situations where you might conceivably choose various threadblock sizes to solve the same problem. 建议通常在您可能可以选择各种线程块大小来解决相同问题的情况下给出。

Let's take vector add as an example. 我们以矢量添加为例。 Suppose my vector is of length 100000. I might choose to do this by launching 100 blocks of 1000 threads each. 假设我的向量长度为100000.我可以选择通过每个启动100个1000个线程的块来执行此操作。 In this case, each block will have 1000 active threads, and 24 inactive threads. 在这种情况下，每个块将具有1000个活动线程和24个非活动线程。 My average utilization of thread resources is 1000/1024 = 97.6%. 我对线程资源的平均利用率是1000/1024 = 97.6％。

Now, what if I chose blocks of size 1024? 现在，如果我选择大小为1024的块呢？ Now I only need to launch 98 blocks. 现在我只需要启动98个块。 The first 97 of these blocks are fully utilized in terms of thread utilization - every thread is doing some thing useful. 这些块中的前97个在线程利用率方面得到了充分利用 - 每个线程都在做一些有用的事情。 The 98th block only has 672 (out of 1024) threads that are doing something useful. 第98个块只有672个（1024个）线程正在做一些有用的事情。 The others are explicitly inactive because of a thread check ( if (idx < N) ) or other construct in the kernel code. 由于线程检查（ if (idx < N) ）或内核代码中的其他构造，其他显式处于非活动状态。 So I have 352 inactive threads in that one block. 所以我在那个块中有352个非活动线程。 But my overall average utilization is 100000/100352 = 99.6% 但我的整体平均利用率是100000/100352 = 99.6％

So in the above scenario, it's better to choose "full" threadblocks, evenly divisible by 32. 所以在上面的场景中，最好选择“完整”的线程块，可以被32整除。

If you are doing vector add on a vector of length 1000, and you intend to do it in a single threadblock, (both may be bad ideas), then it does not matter whether you choose 1000 or 1024 for your threadblock size. 如果你在长度为1000的向量上进行向量添加，并且你打算在一个线程块中进行，（两者可能都是糟糕的想法），那么你选择1000或1024作为线程块大小并不重要。