简体繁体 English

cuda线程如何在单个块内执行？

[英]How do cuda threads are executed inside a single block?

原文 2014-10-07 09:29:28 5 2 parallel-processing/ cuda/ gpu

I have several question regarding cuda. 关于CUDA，我有几个问题。 Following is a figure taken from a book on parallel programming. 以下是摘自一本有关并行编程的书。 It shows how threads are allocated in the device for a multiplication of two vectors each of length 8192. 它显示了如何在设备中分配线程以将两个长度分别为8192的向量相乘。

在此处输入图片说明

1) in threadblock 0 there are 15 SIMD threads. 1）在线程块0中有15个SIMD线程。 Are these 15 threads executed in parallel or just one thread at a specific time? 这15个线程是并行执行的还是在特定时间仅执行一个线程？

2) each block contains 512 elements in this example. 2）在此示例中，每个块包含512个元素。 is this number dependent on the hardware or is it a decision of the programmer? 这个数字取决于硬件还是程序员决定的？

2 个解决方案

1) In this particular example, each thread seems to be assigned to 32 elements in the vector. 1）在此特定示例中，每个线程似乎被分配给向量中的32个元素。 Code that is executed by a single thread is executed sequentially. 由单个线程执行的代码按顺序执行。

2) The size of the thread blocks is up to the programmer. 2）线程块的大小取决于程序员。 However, there are restrictions on the number and size of the thread blocks given the hardware the code is executed on. 但是，给定执行代码的硬件，线程块的数量和大小受到限制。 For more information on this, see this elaborate answer: Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation) 有关此的更多信息，请参见下面的详细解答：了解CUDA网格尺寸，块尺寸和线程组织（简单说明）

From your illustration, it seems that: 从您的插图看来：

The grid is composed of 16 thread blocks, numbered from 0 to 15. 网格由16个线程块组成，编号从0到15。
Each block is composed of 16 "SIMD threads", numbered from 0 to 15 每个块由16个“ SIMD线程”组成，编号从0到15
Each "SIMD thread" computes the product of 32 vector elements. 每个“ SIMD线程”计算32个向量元素的乘积。

It is not necessarily obvious from the illustration whether "SIMD thread" means, in the CUDA (OpenCL) parlance: 从插图中不一定很明显，在CUDA（OpenCL）中，“ SIMD线程”是否意味着：

A warp ( wavefront ) of 32 threads ( work-items ) 32个线程（ 工作项 ）的扭曲（波前）

or: 要么：

A thread ( work-item ) working on 32 elements 处理32个元素的线程（ 工作项 ）

I will assume the former ("SIMD thread" = warp/wavefront), since it is a more reasonable assumption performance-wise, but the latter isn't technically incorrect, it's simply suboptimal design (on current hardware, at least). 我将假定前者（“ SIMD线程” = warp / wavefront），因为从性能角度来说这是一个更合理的假设，但是后者在技术上并不是错误的，它只是次优的设计（至少在当前硬件上如此）。

1) in threadblock 0 there are 15 SIMD threads. 1）在线程块0中有15个SIMD线程。 Are these 15 threads executed in parallel or just one thread at a specific time? 这15个线程是并行执行的还是在特定时间仅执行一个线程？

As stated above, there are 16 warps (numbered from 0 to 15, that makes 16) in thread block 0, each of them made of 32 threads . 如上所述，线程块0中有16个经线 （编号从0到15，共16个），每个都由32个线程组成 。 These threads execute in lockstep, simultaneously, in parallel. 这些线程以锁步方式同时并行执行。 The warps are executed independently from each another, sequentially or in parallel, depending on the capabilities of the underlying hardware. 根据基础硬件的功能，可以相互独立地顺序或并行执行扭曲。 For example, the hardware may be capable of scheduling a number of warps for simultaneous execution. 例如，硬件可能能够调度多个扭曲以同时执行。

2) each block contains 512 elements in this example. 2）在此示例中，每个块包含512个元素。 is this number dependent on the hardware or is it a decision of the programmer? 这个数字取决于硬件还是程序员决定的？

In this case, it is simply a decision of the programmer, but in some cases there are also hardware limitations that could force the programmer into changing the design. 在这种情况下，这仅仅是程序员的决定，但是在某些情况下，还有一些硬件限制可能会迫使程序员更改设计。 For example, there is a maximum number of threads a block can handle, and there is a maximum number of blocks a grid can handle. 例如，一个块可以处理的最大线程数，而网格可以处理的最大块数。