[英]opencl- parallel reduction without local memory
Most of the algorithms for parallel reduction uses shared(local) memory. 大多数并行约简算法使用共享(本地)内存。
Nvidia,AMD, Intel and so on. Nvidia,AMD,Intel等。
But if devices has doesn't have shared(local) memory. 但是,如果设备没有共享(本地)内存。
How can I do it? 我该怎么做?
If i use same algorithms but store temporary value on global memory, is it gonna be work fine? 如果我使用相同的算法,但将临时值存储在全局内存中,那会很好吗?
If the device supports OpenCL 2.0 then work_group_reduce can be used: 如果设备支持OpenCL 2.0,则可以使用work_group_reduce :
gentype work_group_reduce< op > ( gentype x) gentype work_group_reduce <op>(gentype x)
The < op> in work_group_reduce_< op> , work_group_scan_exclusive_< op> and work_group_scan_inclusive_< op> defines the operator and can be add , min or max . 在<op>对 work_group_reduce_ <OP>,work_group_scan_exclusive_ <OP>和work_group_scan_inclusive_ <OP>定义操作员,并且可以是添加 , 最小或最大 。
If I think about it, my comment already was the complete answer. 如果我考虑一下,我的评论已经是完整的答案。
Yes, you can use global memory as a replacement for local memory but: 是的,您可以使用全局内存代替本地内存,但是:
If I have time this evening, I will post a simple example. 如果今天晚上有时间,我将发布一个简单的示例。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.