简体   繁体   English

CUDA-块和线程

[英]CUDA - Blocks and Threads

I have implemented a string matching algorithm on the GPU. 我已经在GPU上实现了字符串匹配算法。 The searching time of a parallel version has been decreased considerably compared with the sequential version of the algorithm, but by using different number of blocks and threads I get different results. 与算法的顺序版本相比,并行版本的搜索时间已大大减少,但是通过使用不同数量的块和线程,我得到了不同的结果。 How can I determine the number of the blocks and threds to get the best results? 如何确定获得最佳效果的块数和数量?

I think this question is hard, if not impossible, to answer for the reason that it really depends on the algorithm and how it is operating. 我认为这个问题很难回答,即使不是不可能,因为它确实取决于算法及其运行方式。 Since i cant see your implementation i can give you some leads: 由于我看不到您的实现,因此我可以给您一些线索:

  1. Don't use global memory & check how you can max out the use of shared memory. 不要使用全局内存,请检查如何最大程度地利用共享内存。 Generally get a good feel of how threads access memory and how data is retrieved etc. 通常可以很好地了解线程如何访问内存以及如何检索数据等。

  2. Understand how your warps operate. 了解您的经纱如何运作。 Sometimes threads in a warp may wait for other threads to finish in case you have 1 to 1 mapping between thread and data. 有时,如果线程和数据之间存在一对一的映射关系,那么扭曲中的线程可能会等待其他线程结束。 So instead of this 1 to 1 mapping, you can map threads to multiple data so that they are kept busy. 因此,您可以将线程映射到多个数据而不是一对一映射,以使它们保持忙碌状态。

  3. Since blocks consist of threads that are group in 32 threads warp, it is the best if the number of threads in a block is a multiple of 32, so that you dont get warps consisting of 3 threads etc. 由于块由在32个线程束中分组的线程组成,因此,最好的是,如果块中的线程数是32的倍数,那么您就不会得到由3个线程等组成的线程束。

  4. Avoid Diverging paths in warps. 避免在变形中分散路径。

I hope it helps a bit. 希望对您有所帮助。

@Chris points are very important too but depend more on the algorithm itself. @Chris点也很重要,但更多地取决于算法本身。

  1. Check the Cuda Manual about Thread alignment regarding memory lookups. 有关内存查找的线程对齐,请参阅《 Cuda手册》。 Shared Memory Arrays should also be size of multiple of 16. 共享内存阵列的大小也应为16的倍数。

  2. Use Coalesced global memory reads. 使用合并的全局内存读取。 But by algorithm design this is often the case and using shared memory helps. 但是通过算法设计通常是这种情况,使用共享内存会有所帮助。

  3. Don't use atomic operations in global memory or at all if possible. 不要在全局内存中使用原子操作,也不要在可能的情况下使用原子操作。 They are very slow. 他们很慢。 Some algorithms using atomic operations can be rewritten using different techniques. 可以使用不同的技术来重写某些使用原子运算的算法。

Without shown code no-one can tell you what is the best or why performance changes. 没有显示的代码,没有人会告诉您什么是最好的,或者为什么性能会发生变化。

The number of threads per block of your kernel is the most important value. 每个内核块的线程数是最重要的值。

Important values to calculate that value are: 计算该值的重要值为:

  • Maximum number of resident threads per multiprocessor 每个多处理器的最大驻留线程数
  • Maximum number of resident blocks per multiprocessor 每个多处理器的最大驻留块数
  • Maximum number of threads per block 每块最大线程数
  • Number of 32-bit registers per multiprocessor 每个多处理器的32位寄存器数

Your algorithms should be scalable across all GPU's reaching 100% occupancy. 您的算法应可在所有GPU达到100%占用率的情况下进行扩展。 For this I created myself a helper class which automatically detects the best thread numbers for the used GPU and passes it to the Kernel as a DEFINE. 为此,我创建了一个帮助器类,该类将自动为使用的GPU检测最佳线程号,并将其作为DEFINE传递给内核。

/**
 * Number of Threads in a Block
 *
 * Maximum number of resident blocks per multiprocessor : 8
 *
 * ///////////////////
 * Compute capability:
 * ///////////////////
 *
 * Cuda [1.0 - 1.1] =   
 *  Maximum number of resident threads per multiprocessor 768
 *  Optimal Usage: 768 / 8 = 96
 * Cuda [1.2 - 1.3] =
 *  Maximum number of resident threads per multiprocessor 1024
 *  Optimal Usage: 1024 / 8 = 128
 * Cuda [2.x] =
 *  Maximum number of resident threads per multiprocessor 1536
 *  Optimal Usage: 1536 / 8 = 192
 */ 
public static int BLOCK_SIZE_DEF = 96;

Example Cuda 1.1 to reach 786 resident Threads per SM 示例Cuda 1.1,每个SM达到786个驻留线程

  • 8 Blocks * 96 Threads per Block = 786 threads 8个块*每个块96个线程= 786个线程
  • 3 Blocks * 256 Threads per Block = 786 threads 3个块*每个块256个线程= 786个线程
  • 1 Blocks * 512 Threads per Block = 512 threads <- 33% of GPU will be idle 1个块*每个块512个线程= 512个线程<-33%的GPU将处于空闲状态

This is also mentioned in the book: 书中也提到了这一点:

Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series) 大规模并行处理器编程:动手方法(GPU计算系列的应用)

Good programming advices: 好的编程建议:

  1. Analyse your kernel code and write down the maximal number of threads it can handle or how many "units" it can process. 分析您的内核代码并写下它可以处理的最大线程数或可以处理的“单位”数。
  2. Also output your register usage and try to lower it to the respective targeted CUDA version. 还要输出您的寄存器用法,并尝试将其降低到相应的目标CUDA版本。 Because if you use too many registers in your kernel less blocks will be executed resulting in less occupancy and performance. 因为如果您在内核中使用太多寄存器,则会执行更少的块,从而减少占用率和性能。
    Example: Using Cuda 1.1 and using optimal number of 768 resident threads per SM you have 8192 registers to use. 示例:使用Cuda 1.1并使用每个SM最佳的768个驻留线程数,您可以使用8192个寄存器。 This leads to 8192 / 768 = 10 maximum registers per thread/kernel. 这导致每个线程/内核最多8192/768 = 10个最大寄存器。 If you use 11 the GPU will use 1 Block less resulting in decreased performance. 如果您使用11,则GPU将减少1 Block,从而导致性能下降。

Example: A matrix independent row vector normalizing kernel of mine. 示例:我的独立于矩阵的行向量归一化内核。

/*
 * ////////////////////////
 * // Compute capability //
 * ////////////////////////
 *
 * Used 12 registers, 540+16 bytes smem, 36 bytes cmem[1]
 * Used 10 registers, 540+16 bytes smem, 36 bytes cmem[1] <-- with -maxregcount 10 Limit for Cuda 1.1
 * I:   Maximum number of Rows = max(x-dim)^max(dimGrid)
 * II:  Maximum number of Columns = unlimited, since they are loaded in a tile loop
 *
 * Cuda [1.0 - 1.3]: 
 * I:   65535^2 = 4.294.836.225
 *
 * Cuda [2.0]:
 * II:  65535^3 = 281.462.092.005.375
 */

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM