CUDA 优化网格步幅循环的块数

Question

I have started implementing a simple 1D array calculation using CUDA.我已经开始使用 CUDA 实现一个简单的一维数组计算。 Following the documentation I have first tried to define an optimal number of blocks and block size按照文档，我首先尝试定义最佳块数和块大小

...
int N_array = 1000000
...
int n_threads = 256;

int n_blocks = ceil(float(N_array / n_threads));
dim3 grid(n_blocks, 1, 1);
dim3 block(n_threads, 1, 1);
...

For the kernel, I have used a grid-stride approach as suggested in the nvidia blog对于内核，我使用了 nvidia 博客中建议的网格步长方法

...
int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
int stride = gridDim.x * blockDim.x;
int threadsInBlock = blockDim.x;

for (unsigned long long n = global_idx; n < N_array; n += stride) {
    ...

My questions are:我的问题是：

Is it fine to define the number of blocks as before?像以前一样定义块数可以吗？ Or should they be defined such that the total number of requested threads is smaller than the number of available CUDA cores?还是应该将它们定义为请求的线程总数小于可用的 CUDA 内核数？ (thinking that blocks in this way will take advantage of the grid-stride loop by doing more calculations). （认为以这种方式的块将通过进行更多计算来利用网格步长循环）。
Since for this large array the number of requested threads is larger than the number of CUDA cores, is there any penalty on having many blocks inactive?由于对于这个大型数组，请求的线程数大于 CUDA 内核数，因此有许多块处于非活动状态是否有任何惩罚？ Compared to requesting less blocks and keeping most of them active?与请求更少的块并保持大部分活跃相比？ (this is related to 1.) （这与1有关。）

Answer 1

Conventional wisdom is that the number of threads in the grid for a grid-stride loop should be sized to roughly match the thread-carrying capacity of the GPU in question.传统观点认为，网格步长循环的网格中线程数的大小应与所讨论的 GPU 的线程承载能力大致匹配。 The reason for this is to maximize the exposed parallelism, which is one of the 2 most important objectives for any CUDA programmer.这样做的原因是最大化暴露的并行性，这是任何 CUDA 程序员的两个最重要的目标之一。 This gives the machine the maximum opportunity to do latency hiding.这使机器有最大的机会进行延迟隐藏。 This is not the same as the number of CUDA cores.这与 CUDA 内核的数量不同。 Divorce yourself from thinking about the number of CUDA cores in your GPU for these types of design questions.对于这些类型的设计问题，请不要考虑 GPU 中的 CUDA 内核数量。 The number of CUDA cores is not relevant to this inquiry. CUDA 内核的数量与此查询无关。

The thread carrying capacity of the GPU, ignoring occupancy limiters, is the number of SMs in the GPU times the maximum number of threads per SM. GPU的线程承载能力，忽略占用限制，是GPU中的SM数量乘以每个SM的最大线程数。

Both of these quantities can be retrieved programmatically, and the deviceQuery sample code demonstrates how.这两个数量都可以通过编程方式检索， deviceQuery 示例代码演示了如何检索。

If you want to be more precise, you can do an occupancy analysis on your kernel to determine the maximum number of threads that can actually be resident on a SM, then multiply this by the number of SMs.如果你想更精确一点，你可以对你的内核做一个占用分析来确定一个 SM 上实际可以驻留的最大线程数，然后将它乘以 SM 的数量。 Occupancy analysis can be done statically, using the occupancy calculator spreadsheet provided as part of the CUDA toolkit, or dynamically using the occupancy API .占用分析可以使用 CUDA 工具包中提供的占用计算器电子表格静态完成，也可以使用占用 API动态完成。 (You can also inspect/measure occupancy after the fact with the nsight compute profiler.) There are many questions already here on the cuda SO tag discussing these things, and it is covered in the programming guide , so I'll not provide an occupancy tutorial here. （您也可以在事后使用 nsight 计算分析器检查/测量占用率。） cuda SO 标签上已经有很多问题在讨论这些事情，并且在编程指南中有介绍，所以我不会提供占用率教程在这里。 The number you arrive at via occupancy analysis is upper-bounded by the calculation of number of SMs times max threads per SM.您通过占用分析得出的数字上限是 SM 数量乘以每个 SM 的最大线程数的计算。

You will want to choose threads per block and number of block values based on that which allows the maximums to be achieved.您将希望根据允许达到最大值的值来选择每个块的线程和块值的数量。 For example, on a cc8.6 GPU with 1536 maximum threads per SM, you would want to choose perhaps 512 threads per block, and then a number of blocks equal to 3 times the number of SMs in your GPU.例如，在每个 SM 最多 1536 个线程的 cc8.6 GPU 上，您可能希望选择每个块 512 个线程，然后块数等于 GPU 中 SM 数量的 3 倍。 You could also choose 256 threads per block and 6 times the number of SMs.您还可以选择每个块 256 个线程和 6 倍的 SM 数量。 Choosing a value of 1024 threads per block, in this particular example, and ignoring occupancy considerations, might not be a good choice.在此特定示例中，选择每个块 1024 个线程的值并忽略占用考虑，可能不是一个好的选择。

CUDA 优化网格步幅循环的块数

问题描述

1 个解决方案

解决方案1
3 已采纳 2022-07-11 15:49:05

CUDA 优化网格步幅循环的块数

问题描述

1 个解决方案

解决方案1 3 已采纳 2022-07-11 15:49:05

解决方案1
3 已采纳 2022-07-11 15:49:05