简体   繁体   English

CUDA:每个线程计算的最佳像素数(灰度)

[英]CUDA: Best number of pixel computed per thread (grayscale)

I'm working on a program to convert an image in grayscale. 我正在开发一个程序来转换灰度图像。 I'm using the CImg library. 我正在使用CImg库。 I have to read for each pixel, the 3 values RGB, calculate the corresponding gray value and store the gray pixel on the output image. 我必须为每个像素读取3个值RGB,计算相应的灰度值并将该灰度像素存储在输出图像上。 I'm working with an NVIDIA GTX 480 . 我正在使用NVIDIA GTX 480 Some details about the card: 有关卡的一些详细信息:

  • Microarchitecture: Fermi 微架构: 费米
  • Compute capability (version): 2.0 计算能力(版本): 2.0
  • Cores per SM (warp size): 32 每个SM的核心数(经线大小): 32
  • Streaming Multiprocessors: 15 流式多处理器: 15
  • Maximum number of resident warps per multiprocessor: 48 每个多处理器的最大驻留扭曲数量: 48
  • Maximum amount of shared memory per multiprocessor: 48KB 每个多处理器的最大共享内存量: 48KB
  • Maximum number of resident threads per multiprocessor: 1536 每个多处理器的最大驻留线程数: 1536
  • Number of 32-bit registers per multiprocessor: 32K 每个多处理器的32位寄存器数: 32K

I'm using a square grid with blocks of 256 threads. 我正在使用带有256个线程块的正方形网格。 This program can have as input images of different sizes (eg 512x512 px, 10000x10000 px). 该程序可以具有不同大小(例如512x512像素,10000x10000像素)的输入图像。 I observed that incrementing the number of the pixels assigned to each thread increments the performance, so it's better than compute one pixel per thread. 我观察到,增加分配给每个线程的像素数量可以提高性能,因此它比每个线程计算一个像素更好。 The problem is, how can I determine the number of pixels to assign to each thread statically? 问题是,如何确定静态分配给每个线程的像素数? Computing tests with every possible number? 计算具有所有可能数字的测试? I know that on the GTX 480, 1536 is the maximum number of resident threads per multiprocessor. 我知道在GTX 480上,每个多处理器的最大驻留线程数为1536。 Have I to consider this number? 我要考虑这个数字吗? The following, is the code executed by the kernel. 以下是内核执行的代码。

for(i = ((gridDim.x + blockIdx.x) * blockDim.x) + threadIdx.x; i < width * height; i += (gridDim.x * blockDim.x)) {
    float grayPix = 0.0f;
    float r = static_cast< float >(inputImage[i]);
    float g = static_cast< float >(inputImage[(width * height) + i]);
    float b = static_cast< float >(inputImage[(2 * width * height) + i]);

    grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
    grayPix = (grayPix * 0.6f) + 0.5f;
    darkGrayImage[i] = static_cast< unsigned char >(grayPix);
}

The problem is, how can I determine the number of pixels to assign to each thread statically? 问题是,如何确定静态分配给每个线程的像素数? Computing tests with every possible number? 计算具有所有可能数字的测试?

Although you haven't shown any code, you've mentioned an observed characteristic: 尽管您没有显示任何代码,但是您提到了一个观察到的特征:

I observed that incrementing the number of the pixels assigned to each thread increments the performance, 我观察到增加分配给每个线程的像素数可以提高性能,

This is actually a fairly common observation for these types of workloads, and it may also be the case that this is more evident on Fermi than on newer architectures. 对于这些类型的工作负载,这实际上是相当普遍的观察结果,也可能是这种情况,在Fermi上比在新架构上更明显。 A similar observation occurs during matrix transpose. 在矩阵转置过程中会发生类似的观察。 If you write a "naive" matrix transpose that transposes one element per thread, and compare it with the matrix transpose discussed here that transposes multiple elements per thread, you will discover, especially on Fermi, that the multiple element per thread transpose can achieve approximately the available memory bandwidth on the device, whereas the one-element-per-thread transpose cannot. 如果编写一个“天真”的矩阵转置来对每个线程转置一个元素,并将其与此处讨论的矩阵转置对每个线程转置多个元素进行比较,您会发现,特别是在Fermi上,每个线程转置的多个元素可以实现大约设备上的可用内存带宽,而每线程一个元素的转置则无法。 This ultimately has to do with the ability of the machine to hide latency , and the ability of your code to expose enough work to allow the machine to hide latency. 最终,这与计算机隐藏延迟的能力以及代码公开足够多的工作以允许计算机隐藏延迟的能力有关。 Understanding the underlying behavior is somewhat involved, but fortunately, the optimization objective is fairly simple. 了解底层行为有些涉及,但是幸运的是,优化目标非常简单。

GPUs hide latency by having lots of available work to switch to, when they are waiting on previously issued operations to complete. 当GPU等待先前发出的操作完成时,它们会通过切换到许多可用工作来隐藏延迟。 So if I have a lot of memory traffic, the individual requests to memory have a long latency associated with them. 因此,如果我有大量的内存流量,则对内存的单个请求将具有较长的延迟。 If I have other work that the machine can do while it is waiting for the memory traffic to return data (even if that work generates more memory traffic), then the machine can use that work to keep itself busy and hide latency. 如果我还有其他工作在等待内存流量返回数据的同时机器可以执行(即使该工作产生了更多的内存流量),那么机器可以使用该工作使自己保持繁忙并隐藏延迟。

The way to give the machine lots of work starts by making sure that we have enabled the maximum number of warps that can fit within the machine's instantaneous capacity. 确保机器已启用最大数量的经线,使其能够容纳在机器的瞬时容量之内,从而为机器提供大量工作。 This number is fairly simple to compute, it is the product of the number of SMs on your GPU and the maximum number of warps that can be resident on each SM. 此数字的计算非常简单,它是GPU上SM的数量与每个SM上可以驻留的最大扭曲数的乘积。 We want to launch a kernel that meets or exceeds this number, but additional warps/blocks beyond this number don't necessarily help us hide latency. 我们希望启动一个达到或超过此数量的内核,但是超出此数量的其他扭曲/块并不一定帮助我们隐藏延迟。

Once we have met the above number, we want to pack as much "work" as possible into each thread. 达到上述数字后,我们希望将尽可能多的“工作”打包到每个线程中。 Effectively, for the problem you describe and the matrix transpose case, packing as much work into each thread means handling multiple elements per thread. 实际上,对于您描述的问题和矩阵转置的情况,将尽可能多的工作打包到每个线程中意味着每个线程处理多个元素。

So the steps are fairly simple: 因此,步骤非常简单:

  1. Launch as many warps as the machine can handle instantaneously 发射机器可以处理的尽可能多的变形
  2. Put all remaining work in the thread code, if possible. 如果可能,将所有剩余的工作放在线程代码中。

Let's take a simplistic example. 让我们举一个简单的例子。 Suppose my GPU has 2 SMs, each of which can handle 4 warps (128 threads). 假设我的GPU有2个SM,每个SM可以处理4个扭曲(128个线程)。 Note that this is not the number of cores, but the "Maximum number of resident warps per multiprocessor" as indicated by the deviceQuery output. 请注意,这不是内核数,而是deviceQuery输出指示的“每个多处理器的最大驻留线程数”。

My objective then is to create a grid of 8 warps, ie 256 threads total (in at least 2 threadblocks, so they can distribute to each of the 2 SMs) and make those warps perform the entire problem by handling multiple elements per thread. 然后,我的目标是创建一个包含8个经线的网格,即总共256个线程(至少2个线程块,因此它们可以分配给2个SM中的每个),并通过处理每个线程多个元素来使这些经线解决整个问题。 So if my overall problem space is a total of 1024x1024 elements, I would ideally want to handle 1024*1024/256 elements per thread. 因此,如果我的总体问题空间总计为1024x1024个元素,则理想情况下,我希望每个线程处理1024 * 1024/256个元素。

Note that this method gives us an optimization direction . 注意,此方法为我们提供了优化方向 We do not necessarily have to achieve this objective completely in order to saturate the machine. 为了使机器饱和,我们不一定必须完全达到这个目的。 It might be the case that it is only necessary, for example, to handle 8 elements per thread, in order to allow the machine to fully hide latency, and usually another limiting factor will appear, as discussed below. 可能是这样的情况,例如仅需要每个线程处理8个元素,以使机器完全隐藏等待时间,通常会出现另一个限制因素,如下所述。

Following this method will tend to remove latency as a limiting factor for performance of your kernel. 遵循此方法将趋于消除延迟 ,这是影响内核性能的限制因素。 Using the profiler, you can assess the extent to which latency is a limiting factor in a number of ways, but a fairly simple one is to capture the sm_efficiency metric , and perhaps compare that metric in the two cases you have outlined (one element per thread, multiple elements per thread). 使用事件探查器,您可以通过多种方式评估延迟是限制因素的程度,但是一个相当简单的方法是捕获sm_efficiency 指标 ,然后在您概述的两种情况下比较该指标(每个要素一个要素)。线程,每个线程多个元素)。 I suspect you will find, for your code, that the sm_efficiency metric indicates a higher efficiency in the multiple elements per thread case, and this is indicating that latency is less of a limiting factor in that case. 我怀疑您会发现,对于您的代码, sm_efficiency指标表明每个线程的多个元素的效率更高,这表明在这种情况下延迟不是限制因素。

Once you remove latency as a limiting factor, you will tend to run into one of the other two machine limiting factors for performance: compute throughput and memory throughput (bandwidth). 一旦删除了延迟作为限制因素,您将趋向于遇到其他两个机器限制因素之一:计算吞吐量和内存吞吐量(带宽)。 In the matrix transpose case, once we have sufficiently dealt with the latency issue, then the kernel tends to run at a speed limited by memory bandwidth. 在矩阵转置的情况下,一旦我们充分处理了延迟问题,内核就会以受内存带宽限制的速度运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM