简体   繁体   English

改进 InterlockedCompareExchange() 的原子读取

[英]improve atomic read from InterlockedCompareExchange()

Assuming architecture is ARM64 or x86-64.假设架构是 ARM64 或 x86-64。

I want to make sure if these two are equivalent:我想确定这两个是否等效:

  1. a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
  2. MyBarrier(); a = *(volatile __int64*)p; MyBarrier();

Where MyBarrier() is a memory barrier (hint) of compiler level, like __asm__ __volatile__ ("" ::: "memory") .其中MyBarrier()是编译器级别的内存屏障(提示),例如__asm__ __volatile__ ("" ::: "memory") So method 2 is supposed to be faster than method 1.所以方法2应该比方法1快。

I heard that _Interlocked() functions would also imply memory barrier of both compiler and hardware level.我听说_Interlocked()函数也意味着编译器和硬件级别的内存障碍。

I heard that read (proper-aligned) intrinsic data is atomic on these architectures, but I am not sure if method 2 could be widely used?我听说读取(正确对齐)内部数据在这些架构上是原子的,但我不确定方法 2 是否可以广泛使用?

(ps. because I think CPU will handle data dependency automatically so hardware barrier is not much considered here.) (ps。因为我认为CPU会自动处理数据依赖,所以这里没有太多考虑硬件障碍。)

Thank you for any advise / correction on this.感谢您对此提出任何建议/更正


Here is some benchmarks on Ivy Bridge (i5 laptop).这是 Ivy Bridge(i5 笔记本电脑)上的一些基准测试。

(1E+006 loops: 27ms ): (1E+006 循环: 27ms ):

; __int64 a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR val$[rsp], rbx

(1E+006 loops: 27ms ): (1E+006 循环: 27ms ):

; __faststorefence(); __int64 a = *(volatile __int64*)p;
lock or DWORD PTR [rsp], 0
mov rcx, QWORD PTR val$[rsp]

(1E+006 loops: 7ms ): (1E+006 循环: 7ms ):

; _mm_sfence(); __int64 a = *(volatile __int64*)p;
sfence
mov rcx, QWORD PTR val$[rsp]

(1E+006 loops: 1.26ms , not synchronized?): (1E+006 循环: 1.26ms ,不同步?):

; __int64 a = *(volatile __int64*)p;
mov rcx, QWORD PTR val$[rsp]

For the second version to be functionally equivalent, you obviously need atomic 64-bit reads, which is true on your platform.要使第二个版本在功能上等效,您显然需要原子 64 位读取,这在您的平台上是正确的。

However, _MemoryBarrier() is not a "hint to the compiler".但是, _MemoryBarrier()不是“对编译器的提示”。 _MemoryBarrier() on x86 prevents compiler and CPU reordering, and also ensures global visibility after the write. x86 上的_MemoryBarrier()可防止编译器和 CPU 重新排序,并确保写入后的全局可见性。 You also probably only need the first _MemoryBarrier() , the second one could be replaced with a _ReadWriteBarrier() unless a is also a shared variable - but you don't even need that since you are reading through a volatile pointer, which will prevent any compiler reordering in MSVC.您也可能只需要第一个_MemoryBarrier() ,第二个可以用_ReadWriteBarrier()替换,除非a也是共享变量 - 但您甚至不需要它,因为您正在读取易失性指针,这将阻止MSVC 中的任何编译器重新排序。

When you create this replacement, you basically end up with pretty much the same result :当你创建这个替换时,你基本上得到了几乎相同的结果

// a = _InterlockedCompareExchange64((__int64*)&val, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR __int64 val, r8 ; val

// _MemoryBarrier(); a = *(volatile __int64*)&val;
lock or DWORD PTR [rsp], r8d
mov rax, QWORD PTR __int64 val ; val

Running these two in a loop, on my i7 Ivy Bridge laptop, gives equal results, within 2-3%.在我的 i7 Ivy Bridge 笔记本电脑上循环运行这两个,得到相同的结果,在 2-3% 之内。

However, with two memory barriers, the "optimized version" is actually around 2x slower.然而,有两个内存屏障,“优化版本”实际上慢了大约 2 倍。

So the better question is: Why are you using _InterlockedCompareExchange64 at all?所以更好的问题是:你为什么要使用_InterlockedCompareExchange64 If you need atomic access to a variable, use std::atomic , and an optimizing compiler should compile it to the most optimized version for your architecture, and add all the necessary barriers to prevent reordering and ensure cache coherency.如果您需要对变量进行原子访问,请使用std::atomic ,优化编译器应将其编译为最适合您的架构的版本,并添加所有必要的障碍以防止重新排序并确保缓存一致性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM