进行最终归约的策略

Question

I am trying to implement an OpenCL version for doing reduction of a array of float.我正在尝试实现一个 OpenCL 版本来减少浮点数组。

To achieve it, I took the following code snippet found on the web :为了实现它，我在网上找到了以下代码片段：

__kernel void sumGPU ( __global const double *input, 
                       __global double *partialSums,
               __local double *localSums)
 {
  uint local_id = get_local_id(0);
  uint group_size = get_local_size(0);

  // Copy from global memory to local memory
  localSums[local_id] = input[get_global_id(0)];

  // Loop for computing localSums
  for (uint stride = group_size/2; stride>0; stride /=2)
     {
      // Waiting for each 2x2 addition into given workgroup
      barrier(CLK_LOCAL_MEM_FENCE);

      // Divide WorkGroup into 2 parts and add elements 2 by 2
      // between local_id and local_id + stride
      if (local_id < stride)
        localSums[local_id] += localSums[local_id + stride];
     }

  // Write result into partialSums[nWorkGroups]
  if (local_id == 0)
    partialSums[get_group_id(0)] = localSums[0];
 }

This kernel code works well but I would like to compute the final sum by adding all the partial sums of each work group.此内核代码运行良好，但我想通过添加每个工作组的所有部分总和来计算最终总和。 Currently, I do this step of final sum by CPU with a simple loop and iterations nWorkGroups .目前，我通过一个简单的循环和迭代nWorkGroups来完成 CPU 最终求和的这一步。

I saw also another solution with atomic functions but it seems to be implemented for int, not for floats.我还看到了另一个带有原子函数的解决方案，但它似乎是为 int 实现的，而不是为浮点数实现的。 I think that only CUDA provides atomic functions for float.我认为只有 CUDA 为浮点数提供原子函数。

I saw also that I could another kernel code which performs this operation of sum but I would like to avoid this solution in order to keep a simple readable source.我还看到我可以使用另一个内核代码来执行这个 sum 操作，但我想避免这个解决方案，以保持一个简单的可读源。 Maybe I cannot do without this solution...也许我不能没有这个解决方案......

I must tell you that I use OpenCL 1.2 (returned by clinfo ) on a Radeon HD 7970 Tahiti 3GB (I think that OpenCL 2.0 is not supported with my card).我必须告诉你，我在 Radeon HD 7970 Tahiti 3GB 上使用 OpenCL 1.2（由clinfo返回）（我认为我的卡不支持 OpenCL 2.0）。

More generally, I would like to get advice about the simplest method to perform this last final summation with my graphics card model and OpenCL 1.2.更一般地说，我想获得有关使用我的显卡模型和 OpenCL 1.2 执行最后一次求和的最简单方法的建议。

Answer 1

Sorry for previous code.抱歉之前的代码。 also It has problem.它也有问题。

CLK_GLOBAL_MEM_FENCE effects only current workgroup. CLK_GLOBAL_MEM_FENCE 仅影响当前工作组。 I confused.我糊涂了。 =[ =[

If you want to reduction sum by GPU, you should enqueue reduction kernel by NDRangeKernel function after clFinish(commandQueue).如果你想通过 GPU 减少 sum，你应该在 clFinish(commandQueue) 之后通过 NDRangeKernel 函数将减少内核入队。

~~Plaese just take concept.~~ ~~Plaese只是采取概念。~~

 
 
 
  
  __kernel void sumGPU ( __global const double *input, __global double *partialSums, __local double *localSums) { uint local_id = get_local_id(0); uint group_size = get_local_size(0); // Copy from global memory to local memory localSums[local_id] = input[get_global_id(0)]; // Loop for computing localSums for (uint stride = group_size/2; stride>0; stride /=2) { // Waiting for each 2x2 addition into given workgroup barrier(CLK_LOCAL_MEM_FENCE); // Divide WorkGroup into 2 parts and add elements 2 by 2 // between local_id and local_id + stride if (local_id < stride) localSums[local_id] += localSums[local_id + stride]; } // Write result into partialSums[nWorkGroups] if (local_id == 0) partialSums[get_group_id(0)] = localSums[0]; barrier(CLK_GLOBAL_MEM_FENCE); if(get_group_id(0)==0){ if(local_id < get_num_groups(0)){ // 16384 for(int n=0 ; n<get_num_groups(0) ; n+= group_size ) localSums[local_id] += partialSums[local_id+n]; barrier(CLK_LOCAL_MEM_FENCE); for(int s=group_size/2;s>0;s/=2){ if(local_id < s) localSums[local_id] += localSums[local_id+s]; barrier(CLK_LOCAL_MEM_FENCE); } if(local_id == 0) partialSums[0] = localSums[0]; } } }

Answer 2

If that float's order of magnitude is smaller than exa scale, then:如果该浮点数的数量级小于exa比例，则：

Instead of而不是

if (local_id == 0)
  partialSums[get_group_id(0)] = localSums[0];

You could use你可以用

if (local_id == 0)
{
    if(strategy==ATOMIC)
    {
        long integer_part=getIntegerPart(localSums[0]);
        atom_add (&totalSumIntegerPart[0] ,integer_part);
        long float_part=1000000*getFloatPart(localSums[0]);
         // 1000000 for saving meaningful 7 digits as integer
        atom_add (&totalSumFloatPart[0] ,float_part);
    }
}

this will overflow float part so when you divide it by 1000000 in another kernel, it may have more than 1000000 value so you get its integer part and add it to the real integer part:这将溢出浮点部分，所以当你在另一个内核中将它除以 1000000 时，它可能有超过 1000000 的值，所以你得到它的整数部分并将其添加到实数部分：

   float value=0;
   if(strategy==ATOMIC)
   {
       float float_part=getFloatPart_(totalSumFloatPart[0]);
       float integer_part=getIntegerPart_(totalSumFloatPart[0])
       + totalSumIntegerPart[0];
       value=integer_part+float_part;
   }

just a few atomic operations shouldn't be effective on whole kernel time.在整个内核时间内，只有几个原子操作不应该有效。

Some of these get___part can be written easily already using floor and similar functions.其中一些get___part可以使用 floor 和类似函数轻松编写。 Some need a divide by 1M.有些需要除以1M。

进行最终归约的策略

问题描述

2 个解决方案

解决方案1
1 2016-04-29 03:50:46

解决方案2
1 已采纳 2016-12-02 15:31:25

进行最终归约的策略

问题描述

2 个解决方案

解决方案1 1 2016-04-29 03:50:46

解决方案2 1 已采纳 2016-12-02 15:31:25

解决方案1
1 2016-04-29 03:50:46

解决方案2
1 已采纳 2016-12-02 15:31:25