简体   繁体   English

CUDA使用共享内存计算直方图

[英]CUDA computing a histogram with shared memory

I'm following a udacity problem set lesson to compute a histogram of numBins element out of a long series of numElems values. 我正在学习一个大胆的问题集课程,以从一连串的numElems值中计算numBins元素的直方图。 In this simple case each element's value is also his own bin in the histogram, so generating with CPU code the histogram is as simple as 在这种简单情况下,每个元素的值也是直方图中自己的bin,因此使用CPU代码生成直方图非常简单

for (i = 0; i < numElems; ++i)
  histo[val[i]]++;

I don't get the video explanation for a "fast histogram computation" according to which I should sort the values by a 'coarse bin id' and then compute the final histogram. 我没有获得有关“快速直方图计算”的视频说明,根据该视频,我应该按照“粗仓ID”对值进行排序,然后计算最终的直方图。

The question is: 问题是:

  • why should I sort the values by 'coarse bin indices'? 为什么要按“粗仓索引”对值进行排序?

why should I sort the values by 'coarse bin indices'? 为什么要按“粗仓索引”对值进行排序?

This is an attempt to break down the work into pieces that can be handled by a single threadblock. 这是尝试将工作分解成可以由单个线程块处理的片段。 There are several considerations here: 这里有几个注意事项:

  1. On a GPU, it's desirable to have multiple threadblocks so that all SMs can be engaged in solving the problem. 在GPU上,最好有多个线程块,以便所有SM都可以参与解决问题。
  2. A given threadblock lives and operates on a single SM, so it is confined to the resources available on that SM, the primary limits being the number of threads and the size of available shared memory. 给定的线程块可以在单个SM上运行并运行,因此它仅限于该SM上可用的资源,主要限制是线程数和可用共享内存的大小。
  3. Since shared memory especially is limited, the division of work creates a smaller-sized histogram operation for each threadblock, which may fit in the SM shared memory whereas the overall histogram range may not. 由于共享内存特别有限,因此工作划分为每个线程块创建了一个较小尺寸的直方图操作,这可能适合于SM共享内存,而整个直方图范围可能不适合。 For example if I am histogramming over a range of 4 decimal digits, that would be 10,000 bins total. 例如,如果我对4个十进制数字进行直方图显示,则总计为10,000个bin。 Each bin would probably need an int value, so that is 40Kbytes, which would just barely fit into shared memory (and might have negative performance implications as an occupancy limiter). 每个bin可能需要一个int值,即40Kbytes,它几乎不能容纳到共享内存中(并且可能对占用限制产生负面影响)。 A histogram over 5 decimal digits probably would not fit. 超过5个十进制数字的直方图可能不适合。 On the other hand, with a "coarse bin sort" of a single decimal digit, I could reduce the per-block shared memory requirement from 40Kbytes to 4Kbytes (approximately). 另一方面,使用一个十进制数字的“粗框排序”,我可以将每个块的共享内存需求从40Kbytes减少到4Kbytes(大约)。

Shared memory atomics are often considerably faster than global memory atomics, so breaking down the work this way allows for efficient use of shared memory atomics, which may be a useful optimization. 共享内存原子通常比全局内存原子快得多,因此以这种方式分解工作可以有效利用共享内存原子,这可能是有用的优化。

so I will have to sort all the values first? 所以我必须首先对所有值进行排序? Isn't that more expensive than reading and doing an atomicAdd into the right bin? 这不是比在正确的容器中读取并添加atomicAdd还要贵吗?

Maybe. 也许。 But the idea of a coarse bin sort is that it may be computationally much less expensive than a full sort. 但是粗分类箱的想法是,它在计算上可能比完整分类便宜得多。 A radix sort is a commonly used, relatively fast sorting operation that can be done in parallel on a GPU. 基数排序是一种常用的,相对较快的排序操作,可以在GPU上并行完成。 Radix sort has the characteristic that the sorting operation begins with the most significant "digit" and proceeds iteratively to the least significant digit. 基数排序具有以下特征:排序操作以最高有效的“数字”开始,然后迭代进行到最低有效的数字。 However a coarse bin sort implies that only some subset of the most significant digits need actually be "sorted". 但是,粗分级分类意味着实际上仅需要对最高有效数字的某些子集进行“分类”。 Therefore, a "coarse bin sort" using a radix sort technique could be computationally substantially less expensive than a full sort. 因此,“粗仓排序”使用基数排序技术可以比一个完整的分类计算基本上更便宜。 If you sort only on the most significant digit out of 3 digits as indicated in the udacity example, that means your sort is only approximately 1/3 as expensive as a full sort. 如果按照udacity示例中的指示,仅对3位数字中的最高有效位进行排序,则这意味着您的排序仅比完整排序的价格高1/3。

I'm not suggesting that this is a guaranteed recipe for faster performance in every case. 我并不是说这是在每种情况下都能保证更快性能的保证配方。 The specifics matter (eg size of histogram, range, final number of bins, etc.) The specific GPU you use may impact the tradeoff also. 细节很重要(例如直方图的大小,范围,仓的最终数量等)。您使用的特定GPU可能也会影响权衡。 For example, Kepler and newer devices will have substantially improved global memory atomics, so the comparison will be substantially impacted by that. 例如,开普勒和更新的设备将大大改善全局内存原子,因此比较将大大影响比较。 (OTOH, Pascal has substantially improved shared memory atomics, which will once again affect the comparison in the other direction.) (OTOH, Pascal已大大改善了共享内存原子,这将再次在另一个方向上影响比较。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM