简体   繁体   中英

Sum reduction with parallel algorithm - Bad performances compared to CPU version

I have achieved a small code for doing sum reduction of a 1D array. I am comparing a CPU sequential version and a OpenCL version.

The code is available on this link1

The kernel code is available on this link2

and if you want to compile : link3 for Makefile

My issue is about the bad performances of GPU version :

for size of vector lower than 1,024 * 10^9 elements (ie with 1024, 10240, 102400, 1024000, 10240000, 102400000 elements ) the runtime for GPU version is higher (slightly higher but higher) than CPU one.

As you can see, I have taken 2^n values in order to have a compatible number of workitems with the size of a workgroup.

Concerning the number of workgroups, I have taken :

// Number of work-groups
  int nWorkGroups = size/local_item_size;

But for a high number of workitems, I wonder if the value of nWorkGroups is suitable ( for example, nWorkGroups = 1.024 * 10^8 / 1024 = 10^5 workgroups , isn't this too much ?? ).

I tried to modify loca_item_size in the range of [64, 128, 256, 512, 1024] but the performances remain bad for all these values.

I have good benefits only for size = 1.024 * 10^9 elements, here are the runtimes :

Size of the vector
1024000000

Problem size = 1024000000

GPU Parallel Reduction : Wall Clock = 20 second 977511 micro

Final Sum Sequential = 5.2428800006710899200e+17

Sequential Reduction : Wall Clock = 337 second 459777 micro

From your experiences, why do I get bad performances ? I though that advantages should be more significative compared to CPU version.

Maybe someone could see into source code a main mistake because, at the moment, I can't get to solve this issue.

Thanks

Well I can tell you some reasons:

  • You don't need to write the reduction buffer. You can directly clear it in GPU memory using clEnqueueFillBuffer() or a helper kernel.

    ret = clEnqueueWriteBuffer(command_queue, reductionBuffer, CL_TRUE, 0, local_item_size * sizeof(double), sumReduction, 0, NULL, NULL);

  • Dont use blocking calls, except for the last read. Otherwise you are wasting some time there.

  • You are doing the last reduction in CPU. Iterative processing trough the kernel can help.

  • Because if your kernel is just reducing 128 elements per pass. Your 10^9 number just gets down to 8*10^6. And the CPU does the rest. If you add there the data copy, it makes it completely non worth. However, if you run 3 passes at 512 elements per pass, you read out from the GPU just 10^9/512^3 = 8 values. So, the only bottleneck would be the first GPU copy and the kernel launch.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM