简体繁体 English

跨块的CUDA总和

[英]CUDA sum across blocks

原文 2018-11-01 17:21:59 6 1 cuda/ gpu

Hello I am new to cuda programming and I got a problem. 您好，我是cuda编程的新手，但遇到了问题。

I have a variable, let's call foo stored in the shared memory of each block with different value from one block to another. 我有一个变量，让我们调用存储在每个块共享内存中的foo ，每个块的值都不同。 And I want only one thread to sum all of them across blocks. 而且我只希望一个线程可以将所有这些求和求和。 I thought to send foo to global memory then compute the sum, but is there any function which can do this more quickly? 我以为可以将foo发送到全局内存中，然后计算总和，但是有没有可以更快地执行此操作的函数？

Thanks for your help. 谢谢你的帮助。

1 个解决方案

It would be faster to have one thread in each block perform an atomicAdd() operation, adding the per-block-value to a single, grid-wide variable in global memory. 在每个块中有一个线程执行atomicAdd()操作会更快，将每个块的值添加到全局内存中的单个网格范围变量中。

See the relevant section of the CUDA C Programming guide . 请参阅《 CUDA C编程指南》的相关部分。

For a deeper exploration of optimizing reductions (= summation), albeit not necessarily the one you want to perform, have a look at Mark Harris' presentation: Optimizing Parallel Reduction in CUDA . 为了更深入地探索优化缩减（=求和），尽管不一定要执行该优化，请查看Mark Harris的演讲：优化CUDA中的并行缩减。