简体   繁体   English

OpenCL 2.x-减少总和功能

[英]OpenCL 2.x - Sum Reduction function

From this previous post : strategy-for-doing-final-reduction , I would like to know the last functionalities offered by OpenCL 2.x (not 1.x which is the subject of this previous post above), especially about the atomic functions which allow to perform reductions of a array (in my case a sum reduction). 从上一篇文章: 最终的减少战略中 ,我想知道OpenCL 2.x (不是以上上一篇文章的主题1.x) 提供的最后功能 ,尤其是关于原子功能允许执行数组的约简(在我的情况下为总和)。

One told me that performances of OpenCL 1.x atomic functions ( atom_add ) were bad and I could check it, so I am looking for a way to get the best performances for a final reduction function (ie the sum of each computed sum corresponding to each work-group). 有人告诉我,OpenCL 1.x原子函数( atom_add )的性能很差,我可以对其进行检查,因此,我正在寻找一种方法来获得final reduction function的最佳性能(即,对应于每个工作组)。

I recall the typical kind of kernel code that I am using for the moment : 我回想起了我目前使用的典型的内核代码:

__kernel void sumGPU ( __global const double *input, 
                       __global double *partialSums,
               __local double *localSums)
 {
  uint local_id = get_local_id(0);
  uint group_size = get_local_size(0);

  // Copy from global memory to local memory
  localSums[local_id] = input[get_global_id(0)];

  // Loop for computing localSums
  for (uint stride = group_size/2; stride>0; stride /=2)
     {
      // Waiting for each 2x2 addition into given workgroup
      barrier(CLK_LOCAL_MEM_FENCE);

      // Divide WorkGroup into 2 parts and add elements 2 by 2
      // between local_id and local_id + stride
      if (local_id < stride)
        localSums[local_id] += localSums[local_id + stride];
     }

  // Write result into partialSums[nWorkGroups]
  if (local_id == 0)
    partialSums[get_group_id(0)] = localSums[0];
 }             

As you can see, at the end of kernel code execution, I get the array partialSums[number_of_workgroups] containing all partial sums. 如您所见,在内核代码执行结束时,我得到了包含所有部分和的数组partialSums[number_of_workgroups]

Could you tell me please how to perform a second and final reduction of this array, with the best performances possibles of functions availables with OpenCL 2.x . 您能告诉我如何执行此数组的第二次也是最后一次还原,以最大可能的性能使用OpenCL 2.x所提供的功能。 A classic solution is to perform this final reduction with CPU but ideally, I would like to do it directly with kernel code . 一个经典的解决方案是使用CPU执行最后的还原,但理想情况下,我想直接使用内核代码来完成

A suggestion of code snippet is welcome. 欢迎提供代码段的建议。

A last point, I am working on MacOS High Sierra 10.13.5 with the following model : 最后一点,我正在使用以下模型在MacOS High Sierra 10.13.5上工作:

模型硬件

Can OpenCL 2.x be installed on my hardware MacOS model ? 可以在我的硬件MacOS型号上安装OpenCL 2.x吗?

Regards 问候

Atomic functions should be avoided because they do harm performance compared to a parallel reduction kernel. 应避免使用原子功能,因为与并行归约内核相比,原子功能会损害性能。 Your kernel looks to be on the right track, but you need to remember that you'll have to invoke it multiple times; 您的内核看起来处于正确的轨道上,但是您需要记住,您必须多次调用它。 do not perform the final sum on the host (unless you have a very small amount of data from the previous reduction). 不要在主机上执行最终的总和(除非您之前还原的数据量很小)。 That is, you need to keep invoking it until your local size equals your global size. 也就是说,您需要继续调用它,直到您的本地大小等于全局大小为止。 There's no way to do a single invocation for large amounts of data as there is no way to synchronize between work groups. 由于无法在工作组之间进行同步,因此无法对大量数据进行一次调用。

Additionally, you want to be careful to set an appropriate work group size (ie local size), which depends on local & global memory throughput & latency. 另外,您要小心设置适当的工作组大小(即本地大小),这取决于本地和全局内存的吞吐量和延迟。 Unfortunately, as far as I'm aware there is no way to determine this through OpenCL, outside of self-profiling code, though that's not too difficult to write as OCL provides you with JIT compilation. 不幸的是,据我所知,没有办法通过OpenCL在自我分析代码之外确定这一点,尽管OCL为您提供JIT编译功能,但编写起来并不难。 Through empirical testing I've found you should find a sweet spot between suffering too many bank conflicts (too large a local size) vs. global memory latency penalties (too small a local size). 通过经验测试,我发现您应该在遭受太多银行冲突(本地大小太大)与全局内存延迟惩罚(本地大小太小)之间找到一个甜蜜点。 It's best to do a benchmark first to determine optimal local size for your reduction, and then use that local size for future reductions. 最好先做一个基准测试,以确定您的缩小的最佳本地大小,然后再使用该本地大小进行将来的缩小。

Edit: It's also worth noting that the best way to chain your kernel invocation together is through OpenCL events. 编辑:还应注意,将内核调用链接在一起的最佳方法是通过OpenCL事件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM