OpenCL-執行縮減的方法

Question

從下面的文章中，我嘗試使用此內核代碼對數組進行求和：

 #pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

__kernel void sumGPU ( __global const long *input, 
               __global long *finalSum
               )
 {
  uint local_id = get_local_id(0);
  uint group_size = get_local_size(0);

  // Temporary local value
  local long tempInput;

  tempInput = input[local_id];

  // Variable for final sum 
  local long totalSumIntegerPart[1];

  // Initialize sums
  if (local_id==0)
    totalSumIntegerPart[0] = 0;

  // Compute atom_add into each workGroup 
  barrier(CLK_LOCAL_MEM_FENCE);

  atom_add(&totalSumIntegerPart[0], tempInput);

  barrier(CLK_LOCAL_MEM_FENCE);

  // Perform sum of each workGroup sum
  if (local_id==(get_local_size(0)-1))
    atom_add(finalSum, totalSumIntegerPart[0]);

}

但是finalSum的值不是預期值（我最初將input數組設置為：

 for (i=0; i<nWorkItems; i++)
    input[i] = i+1;

因此，我期望nWorkItems = 1024 ： finalSum = nWorkItems*(nWorkItems+1)/2=524800

實際上，我得到finalSum = 16384 。

我通過采用sizeWorkGroup = 16和nWorkItems = 1024獲得此結果。

奇怪的是，在sizeWorkGroup = 32和nWorkItems = 1024 ，我得到另一個值： finalSum = 32768

我不明白最后一條指令（應該計算每個部分和的總和，即每個工作組的總和）：

// Perform sum of each workGroup sum
  if (local_id==(get_local_size(0)-1))
    atom_add(finalSum, totalSumIntegerPart[0]);

確實，我會以為指令atom_add(finalSum, totalSumIntegerPart[0]); if condition是獨立於local_id 。

最重要的是，該指令必須執行“ number of workGroups ”次（假設finalSum是所有workGroup之間的共享值，不是嗎？）。

所以我認為我可以代替：

// Perform sum of each workGroup sum
  if (local_id==(get_local_size(0)-1))
    atom_add(finalSum, totalSumIntegerPart[0]);

通過

 // Perform sum of each workGroup sum
      if (local_id==0)
        atom_add(finalSum, totalSumIntegerPart[0]);

任何人都可以使用我的參數（ sizeWorkGroup = 16和nWorkItems = 1024 ）來找到正確的值，即finalSum等於524800嗎？

還是向我解釋為什么最后一筆款項表現不佳？

更新：

這是以下鏈接上的內核代碼（它與我的稍有不同，因為這里的atom_add對每個工作項僅增加1）：

kernel void AtomicSum(global int* sum)

{
 local int tmpSum[1]; 
 if(get_local_id(0)==0){
 tmpSum[0]=0;}

barrier(CLK_LOCAL_MEM_FENCE);         
atomic_add(&tmpSum[0],1);         
barrier(CLK_LOCAL_MEM_FENCE);    

if(get_local_id(0)==(get_local_size(0)-1)){
  atomic_add(sum,tmpSum[0]);
 }

}

我的意思是，這是有效的內核代碼，可以帶來良好的效果嗎？

也許一個解決方案可能是放在我的內核代碼的開頭：

unsigned int tid = get_local_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
// load one tile into local memory
int idx = i * localSize + tid;
localInput[tid] = input[idx];

我將對其進行測試，並及時通知您。

謝謝

Answer 1

這行是錯誤的：

tempInput = input[local_id];

應該：

tempInput = input[get_global_id(0)];

您總是在對輸入的第一個區域求和，這與您的怪異結果一致。 以及為什么它取決於工作組規模的參數。

16*16*64 = 16384
32*32*32 = 32768

您的代碼也可以簡化一些：

  uint local_id = get_local_id(0);

  // Variable for final sum 
  local long totalSumIntegerPart;

  // Initialize sums
  if (local_id==0)
    totalSumIntegerPart = 0;

  // Compute atom_add into each workGroup 
  barrier(CLK_LOCAL_MEM_FENCE);    
  atom_add(&totalSumIntegerPart, input[get_global_id(0)]);    
  barrier(CLK_LOCAL_MEM_FENCE);

  // Perform sum of each workGroup sum
  if (local_id==0)
    atom_add(finalSum, totalSumIntegerPart);

而且我不會像您一樣濫用原子，因為原子不是還原的最有效方法。 使用適當的減少方法，您可能可以將速度提高10倍。 但是，作為PoC或學習本地內存和CL都可以。

OpenCL-執行縮減的方法

問題描述

1 個解決方案

解決方案1
1 已采納 2017-01-24 11:08:43

OpenCL-執行縮減的方法

問題描述

1 個解決方案

解決方案1 1 已采納 2017-01-24 11:08:43

解決方案1
1 已采納 2017-01-24 11:08:43