简体   繁体   English

如何在CUDA中获得原子负载

[英]How to have atomic load in CUDA

My question is how I can have atomic load in CUDA. 我的问题是如何在CUDA中获得原子负载。 Atomic exchange can emulate atomic store. 原子交换可以模拟原子库。 Can atomic load be emulated non-expensively in a similar manner? 可以以类似的方式非昂贵地模拟原子负载吗? I can use an atomic add with 0 to load the content atomically but I think it is expensive because it does an atomic read-modify-write instead of only a read. 我可以使用带有0的原子加载来原子地加载内容,但我认为它很昂贵,因为它执行原子读取 - 修改 - 写入而不是仅读取。

To the best of my knowledge, there is currently no way of requesting an atomic load in CUDA, and that would be a great feature to have. 据我所知, 目前无法在CUDA中请求原子负载,这将是一个很棒的功能。

There are two quasi -alternatives, with their advantages and drawbacks: 有两种替代品,各有优缺点:

  1. Use a no-op atomic read-modify-write as you suggest. 如您所知,使用无操作原子读 - 修改 - 写。 I have provided a similar answer in the past. 我过去曾提供过类似的答案 Guaranteed atomicity and memory consistency but you pay the cost of a needless write. 保证原子性和内存一致性,但您需要支付不必要的写入成本。

  2. In practice, the second closest thing to an atomic load could be marking a variable volatile , although strictly speaking the semantics are completely different. 在实践中,与原子加载相关的第二个最接近的东西可能是标记变量volatile ,尽管严格来说语义完全不同。 The language does not guarantee atomicity of the load (for example, you may in theory get a torn read), but you are guaranteed to get the most up-to-date value. 语言保证负载(例如,你可以在理论上得到了撕裂的读取)的原子,但你肯定可以得到最先进的日期值。 But in practice , as indicated in the comments by @Robert Crovella, it is impossible to get a torn read for properly-aligned transactions of at most 32 bytes, which does make them atomic. 实际上 ,正如@Robert Crovella的评论所指出的那样,对于最多32个字节的正确对齐事务,不可能得到一个撕裂的读取,这确实使它们成为原子。

Solution 2 is kind of hacky and I do not recommend it, but it is currently the only write-less alternative to 1. The ideal solution would be to add a way to express atomic loads directly in the language. 解决方案2有点hacky,我不推荐它,但它是目前唯一的无写入替代方案1.理想的解决方案是添加一种直接用语言表达原子载荷的方法。

In addition to using volatile as recommended in the other answer, using __threadfence appropriately is also required to get an atomic load with safe memory ordering. 除了在另一个答案中建议使用volatile之外,还需要使用__threadfence来获得具有安全内存排序的原子加载。

While some of the comments are saying to just use a normal read because it cannot tear, that is not the same as an atomic load. 虽然有些评论说只是使用普通读数,因为它不会撕裂,但这与原子载荷不同。 There's more to atomics than just tearing: 原子不仅仅是撕裂:

A normal read may reuse a previous load that's already in a register, and thus may not reflect changes made by other SMs with the desired memory ordering. 正常读取可以重用已经在寄存器中的先前加载,因此可能不反映具有期望的存储器排序的其他SM所做的改变。 For instance, int *flag = ...; while (*flag) { ... } 例如, int *flag = ...; while (*flag) { ... } int *flag = ...; while (*flag) { ... } may only read flag once and reuse this value for every iteration of the loop. int *flag = ...; while (*flag) { ... }只能读取一次flag ,并在循环的每次迭代中重复使用该值。 If you're waiting for another thread to change the flag's value, you'll never observe the change. 如果您正在等待另一个线程更改标志的值,您将永远不会观察到更改。 The volatile modifier ensures that the value is actually read from memory on every access. volatile修饰符确保在每次访问时实际从内存中读取值。 See the CUDA documentation on volatile for more info. 有关更多信息,请参阅有关volatileCUDA文档

Additionally, you'll need to use a memory fence to enforce the correct memory ordering in the calling thread. 此外,您需要使用内存栅栏在调用线程中强制执行正确的内存排序。 Without a fence, you get "relaxed" semantics in C++11 parlance, and this can be unsafe when using an atomic for communication. 如果没有围栏,您可以用C ++ 11的说法获得“放松”的语义,当使用原子进行通信时,这可能是不安全的。

For example, say your code (non-atomically) writes some large data to memory and then uses a normal write to set an atomic flag to indicate that the data has been written. 例如,假设您的代码(非原子地)将一些大数据写入内存,然后使用普通写入来设置原子标志以指示数据已被写入。 The instructions may be reordered, hardware cachelines may not be flushed prior to setting the flag, etc etc. The result is that these operations are not guaranteed to be executed in any order, and other threads may not observe these events in the order you expect: The write to the flag is permitted to be happen before the guarded data is written. 可以重新排序指令,在设置标志之前不能刷新硬件高速缓存行等等。结果是不保证以任何顺序执行这些操作,并且其他线程可能不按您期望的顺序观察这些事件:允许写入保护数据之前写入标志。

Meanwhile, if the reading thread is also using normal reads to check the flag before conditionally loading the data, there will be a race at the hardware level. 同时,如果读取线程在有条件地加载数据之前也使用正常读取来检查标志,则将在硬件级别上进行竞争。 Out-of-order and/or speculative execution may load the data before the flag's read is completed. 无序和/或推测执行可能在标志读取​​完成之前加载数据。 The speculatively loaded data is then used, which may not be valid since it was loaded prior to the flag's read. 然后使用推测性加载的数据,这可能无效,因为它是在标志读取​​之前加载的。

Well-placed memory fences prevent these sorts of issues by enforcing that instruction reordering will not affect your desired memory ordering and that previous writes are made visible to other threads. 良好放置的内存防护可以通过强制执行指令重新排序来防止这些问题,这不会影响您所需的内存排序,并且以前的写入对其他线程可见。 __threadfence() and friends are also covered in the CUDA docs . __threadfence()和朋友也包含在CUDA文档中

Putting all of this together, writing your own atomic load method in CUDA looks something like: 将所有这些放在一起,在CUDA中编写自己的原子加载方法看起来像:

// addr must be aligned properly.
__device__ unsigned int atomicLoad(const unsigned int *addr)
{
  const volatile unsigned int *vaddr = addr; // volatile to bypass cache
  __threadfence(); // for seq_cst loads. Remove for acquire semantics.
  const unsigned int value = *vaddr;
  // fence to ensure that dependent reads are correctly ordered
  __threadfence(); 
  return value; 
}

// addr must be aligned properly.
__device__ void atomicStore(unsigned int *addr, unsigned int value)
{
  volatile unsigned int *vaddr = addr; // volatile to bypass cache
  // fence to ensure that previous non-atomic stores are visible to other threads
  __threadfence(); 
  *vaddr = value;
}

This can be written similarly for other non-tearing load/store sizes. 对于其他非撕裂的装载/存储尺寸,这可以类似地编写。

From talking with some NVIDIA devs who work on CUDA atomics, it looks like we should start seeing better support for atomics in CUDA, and the PTX already contains load/store instructions with acquire/release memory ordering semantics -- but there is no way to access them currently without resorting to inline PTX. 通过与一些研究CUDA原子的NVIDIA开发人员交谈,看起来我们应该开始在CUDA中看到对原子的更好支持,并且PTX已经包含带有获取/释放内存排序语义的加载/存储指令 - 但是没有办法目前访问它们而不诉诸内联PTX。 They're hoping to add them in sometime this year. 他们希望在今年的某个时候添加它们。 Once those are in place, a full std::atomic implementation shouldn't be far behind. 一旦到位,完整的std::atomic实现不应该远远落后。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM