简体繁体 English

opencl-并行还原，无需本地内存

[英]opencl- parallel reduction without local memory

原文 2015-09-04 08:11:46 2 2 opencl/ reduction/ prefix-sum

Most of the algorithms for parallel reduction uses shared(local) memory. 大多数并行约简算法使用共享（本地）内存。

Nvidia,AMD, Intel and so on. Nvidia，AMD，Intel等。

But if devices has doesn't have shared(local) memory. 但是，如果设备没有共享（本地）内存。

How can I do it? 我该怎么做？

If i use same algorithms but store temporary value on global memory, is it gonna be work fine? 如果我使用相同的算法，但将临时值存储在全局内存中，那会很好吗？

2 个解决方案

If the device supports OpenCL 2.0 then work_group_reduce can be used: 如果设备支持OpenCL 2.0，则可以使用work_group_reduce ：

gentype work_group_reduce< op > ( gentype x) gentype work_group_reduce <op>（gentype x）

The < op> in work_group_reduce_< op> , work_group_scan_exclusive_< op> and work_group_scan_inclusive_< op> defines the operator and can be add , min or max . 在<op>对 work_group_reduce_ <OP>，work_group_scan_exclusive_ <OP>和work_group_scan_inclusive_ <OP>定义操作员，并且可以是添加，最小或最大。

If I think about it, my comment already was the complete answer. 如果我考虑一下，我的评论已经是完整的答案。

Yes, you can use global memory as a replacement for local memory but: 是的，您可以使用全局内存代替本地内存，但是：

you have to allocate enough global memory for all workgroups and assign the workgroups their chunk of memory (since with local memory, you only have to specifiy as much memory as is needed for a single workgroup and each workgroup will allocate the amount of memory specified) 您必须为所有工作组分配足够的全局内存，并为工作组分配其内存块（由于具有本地内存，因此您只需要指定单个工作组所需的内存，每个工作组将分配指定的内存量）
you have to use CLK_GLOBAL_MEM_FENCE instead of CLK_LOCAL_MEM_FENCE 您必须使用CLK_GLOBAL_MEM_FENCE而不是CLK_LOCAL_MEM_FENCE
you will lose a significant amout of performance 您会损失很多性能

If I have time this evening, I will post a simple example. 如果今天晚上有时间，我将发布一个简单的示例。