简体   繁体   English

跨块的CUDA总和

[英]CUDA sum across blocks

Hello I am new to cuda programming and I got a problem. 您好,我是cuda编程的新手,但遇到了问题。

I have a variable, let's call foo stored in the shared memory of each block with different value from one block to another. 我有一个变量,让我们调用存储在每个块共享内存中的foo ,每个块的值都不同。 And I want only one thread to sum all of them across blocks. 而且我只希望一个线程可以将所有这些求和求和。 I thought to send foo to global memory then compute the sum, but is there any function which can do this more quickly? 我以为可以将foo发送到全局内存中,然后计算总和,但是有没有可以更快地执行此操作的函数?

Thanks for your help. 谢谢你的帮助。

It would be faster to have one thread in each block perform an atomicAdd() operation, adding the per-block-value to a single, grid-wide variable in global memory. 在每个块中有一个线程执行atomicAdd()操作会更快,将每个块的值添加到全局内存中的单个网格范围变量中。

See the relevant section of the CUDA C Programming guide . 请参阅《 CUDA C编程指南》相关部分

For a deeper exploration of optimizing reductions (= summation), albeit not necessarily the one you want to perform, have a look at Mark Harris' presentation: Optimizing Parallel Reduction in CUDA . 为了更深入地探索优化缩减(=求和),尽管不一定要执行该优化 ,请查看Mark Harris的演讲: 优化CUDA中的并行缩减

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM