OpenCL-执行缩减的方法

Question

From the following post , I try to implement a sum reduction of an array with this kernel code : 从下面的文章中，我尝试使用此内核代码对数组进行求和：

 #pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

__kernel void sumGPU ( __global const long *input, 
               __global long *finalSum
               )
 {
  uint local_id = get_local_id(0);
  uint group_size = get_local_size(0);

  // Temporary local value
  local long tempInput;

  tempInput = input[local_id];

  // Variable for final sum 
  local long totalSumIntegerPart[1];

  // Initialize sums
  if (local_id==0)
    totalSumIntegerPart[0] = 0;

  // Compute atom_add into each workGroup 
  barrier(CLK_LOCAL_MEM_FENCE);

  atom_add(&totalSumIntegerPart[0], tempInput);

  barrier(CLK_LOCAL_MEM_FENCE);

  // Perform sum of each workGroup sum
  if (local_id==(get_local_size(0)-1))
    atom_add(finalSum, totalSumIntegerPart[0]);

}

But the value of finalSum is not the expected value (I have initially set the input array to : 但是finalSum的值不是预期值（我最初将input数组设置为：

 for (i=0; i<nWorkItems; i++)
    input[i] = i+1;

So, I expect with nWorkItems = 1024 : finalSum = nWorkItems*(nWorkItems+1)/2=524800 因此，我期望nWorkItems = 1024 ： finalSum = nWorkItems*(nWorkItems+1)/2=524800

And actually, I get finalSum = 16384 . 实际上，我得到finalSum = 16384 。

I get this result by taking a sizeWorkGroup = 16 and nWorkItems = 1024 . 我通过采用sizeWorkGroup = 16和nWorkItems = 1024获得此结果。

Strangely, with sizeWorkGroup = 32 and nWorkItems = 1024 , I get another value : finalSum = 32768 奇怪的是，在sizeWorkGroup = 32和nWorkItems = 1024 ，我得到另一个值： finalSum = 32768

I don't understand the last instruction (which is supposed to compute the sum of each partial sum, ie for each workgroup) : 我不明白最后一条指令（应该计算每个部分和的总和，即每个工作组的总和）：

// Perform sum of each workGroup sum
  if (local_id==(get_local_size(0)-1))
    atom_add(finalSum, totalSumIntegerPart[0]);

Indeed, I would have thought that instruction atom_add(finalSum, totalSumIntegerPart[0]); 确实，我会以为指令atom_add(finalSum, totalSumIntegerPart[0]); would be independent of the local_id if condition . if condition是独立于local_id 。

The most important is this instruction has to be executed " number of workGroups " times (supposing that finalSum is a shared value between all workGroups, isn't it ?). 最重要的是，该指令必须执行“ number of workGroups ”次（假设finalSum是所有workGroup之间的共享值，不是吗？）。

So I thought I could replace : 所以我认为我可以代替：

// Perform sum of each workGroup sum
  if (local_id==(get_local_size(0)-1))
    atom_add(finalSum, totalSumIntegerPart[0]);

by 通过

 // Perform sum of each workGroup sum
      if (local_id==0)
        atom_add(finalSum, totalSumIntegerPart[0]);

Anyone could help to find the right value with my parameters ( sizeWorkGroup = 16 and nWorkItems = 1024 ), ie a finalSum equal to 524800 ? 任何人都可以使用我的参数（ sizeWorkGroup = 16和nWorkItems = 1024 ）来找到正确的值，即finalSum等于524800吗？

or exlain to me why this final sum is not well performed ? 还是向我解释为什么最后一笔款项表现不佳？

UPDATE : 更新：

Here's the kernel code on the following link (it is slightly different from mine because atom_add here only increment 1 for each workitem) : 这是以下链接上的内核代码（它与我的稍有不同，因为这里的atom_add对每个工作项仅增加1）：

kernel void AtomicSum(global int* sum)

{
 local int tmpSum[1]; 
 if(get_local_id(0)==0){
 tmpSum[0]=0;}

barrier(CLK_LOCAL_MEM_FENCE);         
atomic_add(&tmpSum[0],1);         
barrier(CLK_LOCAL_MEM_FENCE);    

if(get_local_id(0)==(get_local_size(0)-1)){
  atomic_add(sum,tmpSum[0]);
 }

}

Is this a valid kernel code, I mean, which gives good results ? 我的意思是，这是有效的内核代码，可以带来良好的效果吗？

Maybe a solution could be to put at the begin of my kernel code : 也许一个解决方案可能是放在我的内核代码的开头：

unsigned int tid = get_local_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
// load one tile into local memory
int idx = i * localSize + tid;
localInput[tid] = input[idx];

I am going to test it and keep you informed. 我将对其进行测试，并及时通知您。

Thanks 谢谢

Answer 1

This line is wrong: 这行是错误的：

tempInput = input[local_id];

Should be: 应该：

tempInput = input[get_global_id(0)];

You are always summing the first area of your input, which is consistent with your weird results. 您总是在对输入的第一个区域求和，这与您的怪异结果一致。 And why it depends on the parameters of work group size. 以及为什么它取决于工作组规模的参数。

16*16*64 = 16384
32*32*32 = 32768

Also your code can be simplified a bit: 您的代码也可以简化一些：

  uint local_id = get_local_id(0);

  // Variable for final sum 
  local long totalSumIntegerPart;

  // Initialize sums
  if (local_id==0)
    totalSumIntegerPart = 0;

  // Compute atom_add into each workGroup 
  barrier(CLK_LOCAL_MEM_FENCE);    
  atom_add(&totalSumIntegerPart, input[get_global_id(0)]);    
  barrier(CLK_LOCAL_MEM_FENCE);

  // Perform sum of each workGroup sum
  if (local_id==0)
    atom_add(finalSum, totalSumIntegerPart);

And I would not abuse as you do of atomics, because they are not the most efficient way of doing reductions. 而且我不会像您一样滥用原子，因为原子不是还原的最有效方法。 You can probably get 10x more speed with proper reduction methods. 使用适当的减少方法，您可能可以将速度提高10倍。 However, it is ok as a PoC or for learning local memory and CL. 但是，作为PoC或学习本地内存和CL都可以。

OpenCL-执行缩减的方法

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-01-24 11:08:43

OpenCL-执行缩减的方法

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-01-24 11:08:43

解决方案1
1 已采纳 2017-01-24 11:08:43