简体繁体 English

opencl中的并行和减少实现

[英]parallel sum reduction implementation in opencl

原文 2015-07-31 15:22:47 7 1 opencl/ gpgpu

I am going through the sample code of NVIDIA provided at link 我正在浏览链接中提供的NVIDIA示例代码

In the sample kernels code (file oclReduction_kernel.c ) reduce4 uses the technique of 在示例内核代码（文件oclReduction_kernel.c ）中，reduce4使用以下技术：

1) unrolling and removing synchronization barrier for thread id < 32. 1）展开和删除线程ID <32的同步屏障。

2) Apart from this the code uses the blockSize checks to sum the data in local memory. 2）除此之外，代码还使用blockSize检查对本地内存中的数据求和。 I think there in OpenCL we have get_local_size(0/1) to know the work group size. 我认为在OpenCL中，我们有get_local_size(0/1)来了解工作组的大小。 Block Size is confusing me. 块大小使我感到困惑。

I am not able to understand both the points mentioned above. 我无法理解上述两点。 Why and how these things helping out in optimization? 为什么这些事情以及如何帮助优化？ Any explanation on reduce5 and reduce6 will be helpful as well. 对reduce5和reduce6的任何解释也将有所帮助。

1 个解决方案

You have that pretty much explained in slide 21 and 22 of https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf which @Marco13 linked in comments. 您已经在https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf的幻灯片21和22中进行了解释，其中@ Marco13在注释中进行了链接。

As reduction proceeds, # “active” threads decreases 随着减少的进行，＃个“活动”线程减少

When s <= 32, we have only one warp left 当s <= 32时，我们仅剩一个翘曲

Instructions are SIMD synchronous within a warp. 指令在扭曲内是SIMD同步的。

That means when s <= 32: 这意味着当s <= 32时：

We don't need to __syncthreads() 我们不需要__syncthreads（）

We don't need “if (tid < s)” because it doesn't save any work 我们不需要“ if（tid <s）”，因为它不保存任何工作

Without unrolling, all warps execute every iteration of the for loop and if statement 在不展开的情况下，所有扭曲都将执行for循环和if语句的每个迭代

And by https://www.pgroup.com/lit/articles/insider/v2n1a5.htm : 并通过https://www.pgroup.com/lit/articles/insider/v2n1a5.htm ：

The code is actually executed in groups of 32 threads, what NVIDIA calls a warp. 该代码实际上以32个线程为一组执行，这被NVIDIA称为warp。

Each core can execute a sequential thread, but the cores execute in what NVIDIA calls SIMT (Single Instruction, Multiple Thread) fashion; 每个内核都可以执行一个顺序线程，但是内核以NVIDIA所谓的SIMT（单指令，多线程）方式执行。 all cores in the same group execute the same instruction at the same time, much like classical SIMD processors. 与传统SIMD处理器非常相似，同一组中的所有内核都同时执行同一条指令。

Re 2) blockSize there looks to be size of the work group. 关于2） blockSize看起来应该是工作组的大小。