简体繁体 English

CUDA共享内存-内核的总和减少

[英]CUDA shared memory - sum reduction from kernel

原文 2013-09-17 21:37:31 3 4 cuda

I am working on big datasets that are image cubes (450x450x1500). 我正在处理图像多维数据集（450x450x1500）的大型数据集。 I have a kernel that works on individual data elements. 我有一个可以处理单个数据元素的内核。 Each data element produces 6 intermediate results (floats). 每个数据元素产生6个中间结果（浮点数）。 My block consists of 1024 threads. 我的块包含1024个线程。 The 6 intermediate results are stored in shared memory by each thread (6 float arrays). 每个线程（6个浮点数组）将6个中间结果存储在共享内存中。 However, now I need to add each of the intermediate result to produce a sum (6 sum values). 但是，现在我需要将每个中间结果相加，以产生一个总和（6个总和值）。 I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code. 我没有足够的全局内存来将这6个float数组保存到全局内存，然后从推力或宿主代码中的任何其他库中进行缩减。

Are there any reduction routines that can be called from inside a kernel function on arrays in shared memory? 共享内存中的数组上是否可以从内核函数内部调用任何还原例程？

What will be the best way to solve this problem? 解决这个问题的最佳方法是什么？ I am a newbie to CUDA programming and would welcome any suggestions. 我是CUDA编程的新手，欢迎任何建议。

4 个解决方案

This seems unlikely: 这似乎不太可能：

I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code. 我没有足够的全局内存来将这6个float数组保存到全局内存，然后从推力或宿主代码中的任何其他库中进行缩减。

I can't imagine how you have enough space to store your data in shared memory but not in global memory. 我无法想象您将如何有足够的空间来将数据存储在共享内存中，而不是在全局内存中。

Anyway, CUB provides reduction routines that can be called from within a threadblock, and that can operate on data stored in shared memory. 无论如何， CUB提供了减少例程，可以从线程块内调用该例程，并且可以对共享内存中存储的数据进行操作。

Or you can write your own sum-reduction code. 或者，您可以编写自己的总和减少代码。 It's not terribly hard to do, there are many questions on SO about it, such as this one . 这并不是很难做到的，因此有很多关于此的问题，例如这个。

Or you could adapt the cuda sample code . 或者，您可以修改cuda示例代码。

Update 更新

After seeing all the comments, I understand that instead of doing 1 or a few times of reduction, you need to do the reductions for 450x450x6 times. 看完所有评论后，我了解到，您无需进行450x450x6倍的缩减，而不是进行1或几次缩减。

In this case there's simpler solution. 在这种情况下，有一个更简单的解决方案。

You don't need to implement relatively complex parallel reduction for each 1500-D vector。 Since you already have 450x450x6 vectors to reduce, you could reduce all these vectors in parallel using traditional serial reduction method. 您无需为每个1500-D向量实施相对复杂的并行归约。由于您已经有450x450x6个要归约的向量，因此可以使用传统的串行归约方法并行缩减所有这些向量。

You could use a block with 16x16 threads to process a particular region of the image, and a grid with 29x29 blocks to cover the whole 450x450 image. 您可以使用具有16x16线程的块来处理图像的特定区域，并使用具有29x29块的网格来覆盖整个450x450图像。

In each thread, you could iterate over the 1500 frames. 在每个线程中，您可以迭代1500个帧。 In each iterration, you coulde first compute the 6 intermediate results, then add them to the sums. 在每个迭代中，您可以先计算6个中间结果，然后将它们添加到总和中。 When yo finish all the iterations, you could write the 6 sums to global mem. 完成所有迭代后，您可以将6个和写入全局内存。

That finishes the kernel design. 这样就完成了内核设计。 And no shared mem is needed. 不需要共享的内存。

You wil find that the performance is very good. 您会发现性能非常好。 Since it is a memory bound operation,it won't be much longer than simply access all the image cube data once. 由于这是内存绑定操作，因此它不会比只访问一次所有图像多维数据集数据长得多。

In case you don't have enough global mem for the whole cube, you could split it into 4 sub-cubes of [1500][225][225], and call the kernel routine on each sub-cube. 如果您没有足够的全局内存用于整个多维数据集，则可以将其分成[1500] [225] [225]的4个子多维数据集，然后在每个子多维数据集上调用内核例程。 The only thing you need to change is the grid size. 您唯一需要更改的是网格大小。

看看这是彻底解释了CUDA并行减少。

If I understand it correctly, each thread should sum up "only" 6 floats. 如果我正确理解，则每个线程应总结为“仅” 6浮点数。

I'm not sure if it is worth doing that by a parallel reduction in general, in the sense that you will experience performance gains. 我不确定一般通过并行减少这样做是否值得，因为您会获得性能上的提升。

If you are targeting a Kepler, you may try to use shuffle operations if you properly set the block size so that your intermediate results fit the Streaming Multiprocessor's registers in some way. 如果您以开普勒为目标，那么如果正确设置块大小，以使中间结果以某种方式适合流多处理器的寄存器，则可以尝试使用随机播放操作。

As also pointed out by Robert Crovella, your statement about the possibility of storing the intermediate results seems strange as the amount of global memory is certainly larger than the amount of shared memory. 正如罗伯特·克罗维拉（Robert Crovella）所指出的那样，您关于存储中间结果的可能性的陈述似乎很奇怪，因为全局内存的数量肯定大于共享内存的数量。