简体   繁体   English

使用并行算法减少总和-与CPU版本相比性能较差

[英]Sum reduction with parallel algorithm - Bad performances compared to CPU version

I have achieved a small code for doing sum reduction of a 1D array. 我已经实现了一个小代码,可以减少一维数组的总和。 I am comparing a CPU sequential version and a OpenCL version. 我正在比较CPU顺序版本和OpenCL版本。

The code is available on this link1 该代码可以在这个链接1

The kernel code is available on this link2 内核代码在此链接上可用2

and if you want to compile : link3 for Makefile 如果要编译: Makefile的link3

My issue is about the bad performances of GPU version : 我的问题是关于GPU版本的不良表现:

for size of vector lower than 1,024 * 10^9 elements (ie with 1024, 10240, 102400, 1024000, 10240000, 102400000 elements ) the runtime for GPU version is higher (slightly higher but higher) than CPU one. 对于小于1,024 * 10 ^ 9个元素的向量(即1024, 10240, 102400, 1024000, 10240000, 102400000 elements ),GPU版本的运行时高于(略高但更高)CPU版本。

As you can see, I have taken 2^n values in order to have a compatible number of workitems with the size of a workgroup. 如您所见,我采用了2 ^ n个值,以使工作项的数量与工作组的大小兼容。

Concerning the number of workgroups, I have taken : 关于工作组的数量,我采取了以下措施:

// Number of work-groups
  int nWorkGroups = size/local_item_size;

But for a high number of workitems, I wonder if the value of nWorkGroups is suitable ( for example, nWorkGroups = 1.024 * 10^8 / 1024 = 10^5 workgroups , isn't this too much ?? ). 但是对于大量的工作项,我想知道nWorkGroups的值是否合适(例如, nWorkGroups = 1.024 * 10^8 / 1024 = 10^5 workgroups ,这不是太多吗?)。

I tried to modify loca_item_size in the range of [64, 128, 256, 512, 1024] but the performances remain bad for all these values. 我试图在[64, 128, 256, 512, 1024] loca_item_size范围内修改loca_item_size ,但是对于所有这些值,性能仍然很差。

I have good benefits only for size = 1.024 * 10^9 elements, here are the runtimes : 我仅对size = 1.024 * 10^9元素有好处,这是运行时:

Size of the vector
1024000000

Problem size = 1024000000

GPU Parallel Reduction : Wall Clock = 20 second 977511 micro

Final Sum Sequential = 5.2428800006710899200e+17

Sequential Reduction : Wall Clock = 337 second 459777 micro

From your experiences, why do I get bad performances ? 从您的经验来看,我为什么会表现不佳? I though that advantages should be more significative compared to CPU version. 尽管与CPU版本相比,我的优势更显着。

Maybe someone could see into source code a main mistake because, at the moment, I can't get to solve this issue. 也许有人会在源代码中看到一个主要错误,因为目前我无法解决此问题。

Thanks 谢谢

Well I can tell you some reasons: 好吧,我可以告诉你一些原因:

  • You don't need to write the reduction buffer. 您无需编写缩减缓冲区。 You can directly clear it in GPU memory using clEnqueueFillBuffer() or a helper kernel. 您可以使用clEnqueueFillBuffer()或帮助程序内核直接将其清除在GPU内存中。

    ret = clEnqueueWriteBuffer(command_queue, reductionBuffer, CL_TRUE, 0, local_item_size * sizeof(double), sumReduction, 0, NULL, NULL);

  • Dont use blocking calls, except for the last read. 除了上次读取的内容外,不要使用阻塞呼叫。 Otherwise you are wasting some time there. 否则,您会在那里浪费时间。

  • You are doing the last reduction in CPU. 您正在最后一次减少CPU。 Iterative processing trough the kernel can help. 通过内核进行迭代处理可以提供帮助。

  • Because if your kernel is just reducing 128 elements per pass. 因为如果您的内核每次减少128个元素。 Your 10^9 number just gets down to 8*10^6. 您的10 ^ 9号码降到8 * 10 ^ 6。 And the CPU does the rest. CPU负责其余的工作。 If you add there the data copy, it makes it completely non worth. 如果在其中添加数据副本,则将使其完全不值钱。 However, if you run 3 passes at 512 elements per pass, you read out from the GPU just 10^9/512^3 = 8 values. 但是,如果以每遍512个元素运行3次遍,则仅从GPU读取10 ^ 9/512 ^ 3 = 8个值。 So, the only bottleneck would be the first GPU copy and the kernel launch. 因此,唯一的瓶颈将是第一个GPU复制和内核启动。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM