[英]Sum reduction with parallel algorithm - Bad performances compared to CPU version
I have achieved a small code for doing sum reduction of a 1D array. 我已经实现了一个小代码,可以减少一维数组的总和。 I am comparing a CPU sequential version and a OpenCL version.
我正在比较CPU顺序版本和OpenCL版本。
The code is available on this link1 该代码可以在这个链接1
The kernel code is available on this link2 内核代码在此链接上可用2
and if you want to compile : link3 for Makefile 如果要编译: Makefile的link3
My issue is about the bad performances of GPU version : 我的问题是关于GPU版本的不良表现:
for size of vector lower than 1,024 * 10^9 elements (ie with 1024, 10240, 102400, 1024000, 10240000, 102400000 elements
) the runtime for GPU version is higher (slightly higher but higher) than CPU one. 对于小于1,024 * 10 ^ 9个元素的向量(即
1024, 10240, 102400, 1024000, 10240000, 102400000 elements
),GPU版本的运行时高于(略高但更高)CPU版本。
As you can see, I have taken 2^n values in order to have a compatible number of workitems with the size of a workgroup. 如您所见,我采用了2 ^ n个值,以使工作项的数量与工作组的大小兼容。
Concerning the number of workgroups, I have taken : 关于工作组的数量,我采取了以下措施:
// Number of work-groups
int nWorkGroups = size/local_item_size;
But for a high number of workitems, I wonder if the value of nWorkGroups is suitable ( for example, nWorkGroups = 1.024 * 10^8 / 1024 = 10^5 workgroups
, isn't this too much ?? ). 但是对于大量的工作项,我想知道nWorkGroups的值是否合适(例如,
nWorkGroups = 1.024 * 10^8 / 1024 = 10^5 workgroups
,这不是太多吗?)。
I tried to modify loca_item_size
in the range of [64, 128, 256, 512, 1024]
but the performances remain bad for all these values. 我试图在
[64, 128, 256, 512, 1024]
loca_item_size
范围内修改loca_item_size
,但是对于所有这些值,性能仍然很差。
I have good benefits only for size = 1.024 * 10^9
elements, here are the runtimes : 我仅对
size = 1.024 * 10^9
元素有好处,这是运行时:
Size of the vector
1024000000
Problem size = 1024000000
GPU Parallel Reduction : Wall Clock = 20 second 977511 micro
Final Sum Sequential = 5.2428800006710899200e+17
Sequential Reduction : Wall Clock = 337 second 459777 micro
From your experiences, why do I get bad performances ? 从您的经验来看,我为什么会表现不佳? I though that advantages should be more significative compared to CPU version.
尽管与CPU版本相比,我的优势更显着。
Maybe someone could see into source code a main mistake because, at the moment, I can't get to solve this issue. 也许有人会在源代码中看到一个主要错误,因为目前我无法解决此问题。
Thanks 谢谢
Well I can tell you some reasons: 好吧,我可以告诉你一些原因:
You don't need to write the reduction buffer. 您无需编写缩减缓冲区。 You can directly clear it in GPU memory using
clEnqueueFillBuffer()
or a helper kernel. 您可以使用
clEnqueueFillBuffer()
或帮助程序内核直接将其清除在GPU内存中。
ret = clEnqueueWriteBuffer(command_queue, reductionBuffer, CL_TRUE, 0, local_item_size * sizeof(double), sumReduction, 0, NULL, NULL);
Dont use blocking calls, except for the last read. 除了上次读取的内容外,不要使用阻塞呼叫。 Otherwise you are wasting some time there.
否则,您会在那里浪费时间。
You are doing the last reduction in CPU. 您正在最后一次减少CPU。 Iterative processing trough the kernel can help.
通过内核进行迭代处理可以提供帮助。
Because if your kernel is just reducing 128 elements per pass. 因为如果您的内核每次减少128个元素。 Your 10^9 number just gets down to 8*10^6.
您的10 ^ 9号码降到8 * 10 ^ 6。 And the CPU does the rest.
CPU负责其余的工作。 If you add there the data copy, it makes it completely non worth.
如果在其中添加数据副本,则将使其完全不值钱。 However, if you run 3 passes at 512 elements per pass, you read out from the GPU just 10^9/512^3 = 8 values.
但是,如果以每遍512个元素运行3次遍,则仅从GPU读取10 ^ 9/512 ^ 3 = 8个值。 So, the only bottleneck would be the first GPU copy and the kernel launch.
因此,唯一的瓶颈将是第一个GPU复制和内核启动。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.