CUDA中更快的是什么：全局内存写入+ __threadfence（）或atomicExch（）到全局内存？

Question

Assuming that we have lots of threads that will access global memory sequentially, which option performs faster in the overall? 假设我们有许多线程可以顺序访问全局内存，哪个选项在整体上执行得更快？ I'm in doubt because __threadfence() takes into account all shared and global memory writes but the writes are coalesced. 我有点怀疑，因为__threadfence（）会考虑所有共享和全局内存写入，但写入会被合并。 In the other hand atomicExch() takes into account just the important memory addresses but I don't know if the writes are coalesced or not. 另一方面，atomicExch（）仅考虑重要的内存地址，但我不知道写入是否合并。

In code: 在代码中：

array[threadIdx.x] = value;

Or 要么

atomicExch(&array[threadIdx.x] , value);

Thanks. 谢谢。

Answer 1

On Kepler GPUs, I would bet on atomicExch since atomics are very fast on Kepler. 在Kepler GPU上，我打算使用atomicExch因为Kepler上的atomicExch非常快。 On Fermi, it may be a wash, but given that you have no collisions, atomicExch could still perform well. 在费米，它可能是一个洗，但鉴于你没有碰撞， atomicExch仍然可以表现良好。

Please make an experiment and report the results. 请进行实验并报告结果。

Answer 2

Those two do very different things. 这两个人做的事情非常不同。

atomicExch ensures that no two threads try to modify a given cell at a time. atomicExch确保没有两个线程一次尝试修改给定的单元格。 If such conflict would occur, one or more threads may be stalled. 如果发生此类冲突，则可能会停止一个或多个线程。 If you know beforehand that no two threads access the same cell, there is no point to use any atomic... function. 如果您事先知道没有两个线程访问同一个单元格，则没有必要使用任何atomic...函数。

__threadfence() delays the current thread (and only the current thread!) to ensure that any subsequent writes by given thread do actually happen later. __threadfence()延迟当前线程（并且只延迟当前线程！）以确保给定线程的任何后续写入确实在以后发生。 As such, __threadfence() on its own, without any follow-up code is not very interesting. 因此， __threadfence()本身没有任何后续代码也不是很有趣。

For that reason, I don't think there is a point to compare the efficiency of those two. 因此，我认为没有必要比较这两者的效率。 Maybe if you could show a bit more concrete use case I could relate... 也许如果你能展示一些更具体的用例我可以联系到......

Note, that neither of those actually give you any guarantees on the actual order of execution of the threads. 请注意，这些实际上都没有对线程执行的实际顺序提供任何保证。

CUDA中更快的是什么：全局内存写入+ __threadfence（）或atomicExch（）到全局内存？

问题描述

2 个解决方案

解决方案1
2 2012-09-12 06:51:38

解决方案2
0 2016-08-11 15:00:23

CUDA中更快的是什么：全局内存写入+ __threadfence（）或atomicExch（）到全局内存？

问题描述

2 个解决方案

解决方案1 2 2012-09-12 06:51:38

解决方案2 0 2016-08-11 15:00:23

解决方案1
2 2012-09-12 06:51:38

解决方案2
0 2016-08-11 15:00:23