[英]OpenCL - Method to perform a reduction
From the following post , I try to implement a sum reduction of an array with this kernel code : 从下面的文章中 ,我尝试使用此内核代码对数组进行求和:
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
__kernel void sumGPU ( __global const long *input,
__global long *finalSum
)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Temporary local value
local long tempInput;
tempInput = input[local_id];
// Variable for final sum
local long totalSumIntegerPart[1];
// Initialize sums
if (local_id==0)
totalSumIntegerPart[0] = 0;
// Compute atom_add into each workGroup
barrier(CLK_LOCAL_MEM_FENCE);
atom_add(&totalSumIntegerPart[0], tempInput);
barrier(CLK_LOCAL_MEM_FENCE);
// Perform sum of each workGroup sum
if (local_id==(get_local_size(0)-1))
atom_add(finalSum, totalSumIntegerPart[0]);
}
But the value of finalSum
is not the expected value (I have initially set the input
array to : 但是
finalSum
的值不是预期值(我最初将input
数组设置为:
for (i=0; i<nWorkItems; i++)
input[i] = i+1;
So, I expect with nWorkItems = 1024
: finalSum = nWorkItems*(nWorkItems+1)/2=524800
因此,我期望
nWorkItems = 1024
: finalSum = nWorkItems*(nWorkItems+1)/2=524800
And actually, I get finalSum = 16384
. 实际上,我得到
finalSum = 16384
。
I get this result by taking a sizeWorkGroup = 16
and nWorkItems = 1024
. 我通过采用
sizeWorkGroup = 16
和nWorkItems = 1024
获得此结果。
Strangely, with sizeWorkGroup = 32
and nWorkItems = 1024
, I get another value : finalSum = 32768
奇怪的是,在
sizeWorkGroup = 32
和nWorkItems = 1024
,我得到另一个值: finalSum = 32768
I don't understand the last instruction (which is supposed to compute the sum of each partial sum, ie for each workgroup) : 我不明白最后一条指令(应该计算每个部分和的总和,即每个工作组的总和):
// Perform sum of each workGroup sum
if (local_id==(get_local_size(0)-1))
atom_add(finalSum, totalSumIntegerPart[0]);
Indeed, I would have thought that instruction atom_add(finalSum, totalSumIntegerPart[0]);
确实,我会以为指令
atom_add(finalSum, totalSumIntegerPart[0]);
would be independent of the local_id
if condition
. if condition
是独立于local_id
。
The most important is this instruction has to be executed " number of workGroups
" times (supposing that finalSum is a shared value between all workGroups, isn't it ?). 最重要的是,该指令必须执行“
number of workGroups
”次(假设finalSum是所有workGroup之间的共享值,不是吗?)。
So I thought I could replace : 所以我认为我可以代替:
// Perform sum of each workGroup sum
if (local_id==(get_local_size(0)-1))
atom_add(finalSum, totalSumIntegerPart[0]);
by 通过
// Perform sum of each workGroup sum
if (local_id==0)
atom_add(finalSum, totalSumIntegerPart[0]);
Anyone could help to find the right value with my parameters ( sizeWorkGroup = 16
and nWorkItems = 1024
), ie a finalSum
equal to 524800
? 任何人都可以使用我的参数(
sizeWorkGroup = 16
和nWorkItems = 1024
)来找到正确的值,即finalSum
等于524800
吗?
or exlain to me why this final sum is not well performed ? 还是向我解释为什么最后一笔款项表现不佳?
UPDATE : 更新:
Here's the kernel code on the following link (it is slightly different from mine because atom_add
here only increment 1 for each workitem) : 这是以下链接上的内核代码(它与我的稍有不同,因为这里的
atom_add
对每个工作项仅增加1):
kernel void AtomicSum(global int* sum)
{
local int tmpSum[1];
if(get_local_id(0)==0){
tmpSum[0]=0;}
barrier(CLK_LOCAL_MEM_FENCE);
atomic_add(&tmpSum[0],1);
barrier(CLK_LOCAL_MEM_FENCE);
if(get_local_id(0)==(get_local_size(0)-1)){
atomic_add(sum,tmpSum[0]);
}
}
Is this a valid kernel code, I mean, which gives good results ? 我的意思是,这是有效的内核代码,可以带来良好的效果吗?
Maybe a solution could be to put at the begin of my kernel code : 也许一个解决方案可能是放在我的内核代码的开头:
unsigned int tid = get_local_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
// load one tile into local memory
int idx = i * localSize + tid;
localInput[tid] = input[idx];
I am going to test it and keep you informed. 我将对其进行测试,并及时通知您。
Thanks 谢谢
This line is wrong: 这行是错误的:
tempInput = input[local_id];
Should be: 应该:
tempInput = input[get_global_id(0)];
You are always summing the first area of your input, which is consistent with your weird results. 您总是在对输入的第一个区域求和,这与您的怪异结果一致。 And why it depends on the parameters of work group size.
以及为什么它取决于工作组规模的参数。
16*16*64 = 16384
32*32*32 = 32768
Also your code can be simplified a bit: 您的代码也可以简化一些:
uint local_id = get_local_id(0);
// Variable for final sum
local long totalSumIntegerPart;
// Initialize sums
if (local_id==0)
totalSumIntegerPart = 0;
// Compute atom_add into each workGroup
barrier(CLK_LOCAL_MEM_FENCE);
atom_add(&totalSumIntegerPart, input[get_global_id(0)]);
barrier(CLK_LOCAL_MEM_FENCE);
// Perform sum of each workGroup sum
if (local_id==0)
atom_add(finalSum, totalSumIntegerPart);
And I would not abuse as you do of atomics, because they are not the most efficient way of doing reductions. 而且我不会像您一样滥用原子,因为原子不是还原的最有效方法。 You can probably get 10x more speed with proper reduction methods.
使用适当的减少方法,您可能可以将速度提高10倍。 However, it is ok as a PoC or for learning local memory and CL.
但是,作为PoC或学习本地内存和CL都可以。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.