简体繁体 English

在CUDA中最有效地启动多少个块？

[英]Most efficient number of blocks to launch in CUDA?

原文 2013-05-08 22:33:48 6 3 cuda/ gpgpu

I have a very large array with N0 elements. 我有一个非常大的数组，其中包含N0元素。
Each thread will loop over and operate on m elements. 每个线程将循环并在m元素上运行。
I have fixed TBP threads per block. 我每个块都有固定的TBP线程。
CUDA constrains blocks per grid BPG < 65535 =: BPG_max CUDA约束每个网格的块BPG < 65535 =: BPG_max

Now, let's downsize and consider an array of N0 = 90 elements with TBP = 32 . 现在，让我们缩小尺寸并考虑TBP = 32的N0 = 90元素的数组。

I could fire off 3 blocks of 32 threads each looping once (m = 1) which means 3 x 32 x 1 = 96 elements could have been operated on - ie wastage of 6 . 我可以解开3 blocks of 32 threads each looping once (m = 1) ，这意味着可以操作3 x 32 x 1 = 96元素，即浪费6 3 x 32 x 1 = 96元素。
Or I could fire off 2 blocks of 32 with m = 2 which means 2 x 32 x 2 = 128 elements could have been operated upon, which is a wastage of 38 . 或者我可以2 blocks of 32 with m = 2触发2 blocks of 32 with m = 2这意味着可以操作2 x 32 x 2 = 128元素，这浪费了38 。

With large arrays (100MB+) and lots of loops (10,000+), the factors get bigger and so the wastage can get very large, so how do I minimize wastage ? 使用大型阵列（100MB +）和大量循环（10,000+）时，因素会变得更大，因此浪费会变得非常大，那么如何最大程度地减少浪费呢？ That is, I'd like a procedure to optimize (where N denotes actual work done): 也就是说，我想优化一个过程（其中N表示实际完成的工作）：

在此处输入图片说明

3 个解决方案

I would not be worried about "wasted" threads - GPU threads are lightweight. 我不会担心“浪费”的线程-GPU线程是轻量级的。

You might actually want to increase the block size as this could increase the occupancy of your GPU. 您实际上可能希望增加块大小，因为这可能会增加GPU的占用率。 Note that SMX (in GeForce 6xx line) can only execute 16 concurrent blocks. 请注意，SMX（在GeForce 6xx行中）只能执行16个并发块。 Making blocks larger would allow you to schedule more threads to hide memory access latency. 使块更大可以使您安排更多的线程来隐藏内存访问延迟。

That is in fact a quite complicated problem, and I doubt that there is a O(1) solution to it. 这实际上是一个相当复杂的问题，我怀疑是否有O（1）解决方案。 But I'm guessing that you can afford some linear time on CPU to compute that minimum. 但是我猜测您可以在CPU上花费一些线性时间来计算最小值。

This is wolfram alpha's opinion. 这是Wolfram alpha的意见。

Depending on what all you are doing in your kernels, the answer might or might not be as simple as the optimization problem you cite. 取决于您在内核中所做的一切，答案可能与您列举的优化问题一样简单，也可能不一样。 Eg if you are going to have issues of latency, threads waiting for each other to complete, etc. then there are more issues to consider. 例如，如果您将遇到延迟问题，等待彼此完成的线程等问题，那么还有更多问题需要考虑。

This site has some great heuristics. 这个站点有一些很棒的启发法。 Some general highlights: 一些一般的亮点：

Choosing Blocks Per Grid 每个网格选择块

Blocks per grid should be >= number of multiprocessors. 每个网格的块数应> =多处理器数量。
The more use of __syncthreads() in your kernels, the more blocks (so that one block can run while another waits to sync) 内核中__syncthreads()使用越多，块越多（这样一个块可以运行，而另一个等待同步）

Choosing Threads Per Block 每个块选择线程

Threads in multiples of warp size (ie generally 32) 线程数为经线大小的倍数（即通常为32）
Generally good to choose number of threads such that max number of threads per block (based on hardware) is a multiple of number of threads. 通常最好选择线程数，这样每个块的最大线程数（基于硬件）是线程数的倍数。 Eg with max threads of 768, using 256 threads per block will tend to be better than 512 because multiple threads can run simultaneously on a block. 例如，最大线程数为768，每个块使用256个线程往往会好于512，因为一个块上可以同时运行多个线程。
Think about whether your threads will share memory and if so, how many you'll want to have sharing. 考虑一下您的线程是否将共享内存，如果共享，那么您将要共享多少个内存。