Parallel reduction sum on gpu computes wrong opencl

Question

So I have written a parallel reduction sum on the gpu in the global memory, because my gpu does not have shared memory(I believe this means that I cant use local memory?). Problem is when I try to add more than 1024*4 number of numbers it begins to output the wrong solution, usually its off by a few hundred to a few thousand depending on how many numbers I input. What could the reason be? A is the input, C is the output.

  __kernel void GMM(__global float *A, __global float *B, __global float *C) 
{
uint global_id =get_global_id(0);
uint group_size=get_global_size(0);

B[global_id]=A[global_id];
for(int stride = group_size/2;stride>0;stride /=2)
{ 
    if(global_id<stride)
    {
         B[global_id]+=B[global_id+stride];
    } 
}
if(global_id == 0)
C[get_group_id(0)]=B[0];
}

Answer 1

solved it apparently I do have shared memory. And by using __local memory and local barriers the solutions are consistent and correct!

Parallel reduction sum on gpu computes wrong opencl

Question

1 answers

solution1
0 2016-03-30 09:44:17

Parallel reduction sum on gpu computes wrong opencl

Question

1 answers

solution1 0 2016-03-30 09:44:17

solution1
0 2016-03-30 09:44:17