Sum reduction with parallel algorithm - Bad performances compared to CPU version

Question

I have achieved a small code for doing sum reduction of a 1D array. I am comparing a CPU sequential version and a OpenCL version.

The code is available on this link1

The kernel code is available on this link2

and if you want to compile : link3 for Makefile

My issue is about the bad performances of GPU version :

for size of vector lower than 1,024 * 10^9 elements (ie with 1024, 10240, 102400, 1024000, 10240000, 102400000 elements ) the runtime for GPU version is higher (slightly higher but higher) than CPU one.

As you can see, I have taken 2^n values in order to have a compatible number of workitems with the size of a workgroup.

Concerning the number of workgroups, I have taken :

// Number of work-groups
  int nWorkGroups = size/local_item_size;

But for a high number of workitems, I wonder if the value of nWorkGroups is suitable ( for example, nWorkGroups = 1.024 * 10^8 / 1024 = 10^5 workgroups , isn't this too much ?? ).

I tried to modify loca_item_size in the range of [64, 128, 256, 512, 1024] but the performances remain bad for all these values.

I have good benefits only for size = 1.024 * 10^9 elements, here are the runtimes :

Size of the vector
1024000000

Problem size = 1024000000

GPU Parallel Reduction : Wall Clock = 20 second 977511 micro

Final Sum Sequential = 5.2428800006710899200e+17

Sequential Reduction : Wall Clock = 337 second 459777 micro

From your experiences, why do I get bad performances ? I though that advantages should be more significative compared to CPU version.

Maybe someone could see into source code a main mistake because, at the moment, I can't get to solve this issue.

Thanks

Answer 1

Well I can tell you some reasons:

You don't need to write the reduction buffer. You can directly clear it in GPU memory using clEnqueueFillBuffer() or a helper kernel.
ret = clEnqueueWriteBuffer(command_queue, reductionBuffer, CL_TRUE, 0, local_item_size * sizeof(double), sumReduction, 0, NULL, NULL);
Dont use blocking calls, except for the last read. Otherwise you are wasting some time there.
You are doing the last reduction in CPU. Iterative processing trough the kernel can help.
Because if your kernel is just reducing 128 elements per pass. Your 10^9 number just gets down to 8*10^6. And the CPU does the rest. If you add there the data copy, it makes it completely non worth. However, if you run 3 passes at 512 elements per pass, you read out from the GPU just 10^9/512^3 = 8 values. So, the only bottleneck would be the first GPU copy and the kernel launch.

Sum reduction with parallel algorithm - Bad performances compared to CPU version

Question

1 answers

solution1
1 ACCPTED 2016-02-17 16:04:59

Sum reduction with parallel algorithm - Bad performances compared to CPU version

Question

1 answers

solution1 1 ACCPTED 2016-02-17 16:04:59

solution1
1 ACCPTED 2016-02-17 16:04:59