改进 InterlockedCompareExchange() 的原子读取

Question

Assuming architecture is ARM64 or x86-64.假设架构是 ARM64 或 x86-64。

I want to make sure if these two are equivalent:我想确定这两个是否等效：

a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
MyBarrier(); a = *(volatile __int64*)p; MyBarrier();

Where MyBarrier() is a memory barrier (hint) of compiler level, like __asm__ __volatile__ ("" ::: "memory") .其中MyBarrier()是编译器级别的内存屏障（提示），例如__asm__ __volatile__ ("" ::: "memory") 。 So method 2 is supposed to be faster than method 1.所以方法2应该比方法1快。

I heard that _Interlocked() functions would also imply memory barrier of both compiler and hardware level.我听说_Interlocked()函数也意味着编译器和硬件级别的内存障碍。

I heard that read (proper-aligned) intrinsic data is atomic on these architectures, but I am not sure if method 2 could be widely used?我听说读取（正确对齐）内部数据在这些架构上是原子的，但我不确定方法 2 是否可以广泛使用？

(ps. because I think CPU will handle data dependency automatically so hardware barrier is not much considered here.) （ps。因为我认为CPU会自动处理数据依赖，所以这里没有太多考虑硬件障碍。）

Thank you for any advise / correction on this.感谢您对此提出任何建议/更正。

Here is some benchmarks on Ivy Bridge (i5 laptop).这是 Ivy Bridge（i5 笔记本电脑）上的一些基准测试。

(1E+006 loops: 27ms ): （1E+006 循环： 27ms ）：

; __int64 a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR val$[rsp], rbx

(1E+006 loops: 27ms ): （1E+006 循环： 27ms ）：

; __faststorefence(); __int64 a = *(volatile __int64*)p;
lock or DWORD PTR [rsp], 0
mov rcx, QWORD PTR val$[rsp]

(1E+006 loops: 7ms ): （1E+006 循环： 7ms ）：

; _mm_sfence(); __int64 a = *(volatile __int64*)p;
sfence
mov rcx, QWORD PTR val$[rsp]

(1E+006 loops: 1.26ms , not synchronized?): （1E+006 循环： 1.26ms ，不同步？）：

; __int64 a = *(volatile __int64*)p;
mov rcx, QWORD PTR val$[rsp]

Answer 1

For the second version to be functionally equivalent, you obviously need atomic 64-bit reads, which is true on your platform.要使第二个版本在功能上等效，您显然需要原子 64 位读取，这在您的平台上是正确的。

However, _MemoryBarrier() is not a "hint to the compiler".但是， _MemoryBarrier()不是“对编译器的提示”。 _MemoryBarrier() on x86 prevents compiler and CPU reordering, and also ensures global visibility after the write. x86 上的_MemoryBarrier()可防止编译器和 CPU 重新排序，并确保写入后的全局可见性。 You also probably only need the first _MemoryBarrier() , the second one could be replaced with a _ReadWriteBarrier() unless a is also a shared variable - but you don't even need that since you are reading through a volatile pointer, which will prevent any compiler reordering in MSVC.您也可能只需要第一个_MemoryBarrier() ，第二个可以用_ReadWriteBarrier()替换，除非a也是共享变量 - 但您甚至不需要它，因为您正在读取易失性指针，这将阻止MSVC 中的任何编译器重新排序。

When you create this replacement, you basically end up with pretty much the same result :当你创建这个替换时，你基本上得到了几乎相同的结果：

// a = _InterlockedCompareExchange64((__int64*)&val, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR __int64 val, r8 ; val

// _MemoryBarrier(); a = *(volatile __int64*)&val;
lock or DWORD PTR [rsp], r8d
mov rax, QWORD PTR __int64 val ; val

Running these two in a loop, on my i7 Ivy Bridge laptop, gives equal results, within 2-3%.在我的 i7 Ivy Bridge 笔记本电脑上循环运行这两个，得到相同的结果，在 2-3% 之内。

However, with two memory barriers, the "optimized version" is actually around 2x slower.然而，有两个内存屏障，“优化版本”实际上慢了大约 2 倍。

So the better question is: Why are you using _InterlockedCompareExchange64 at all?所以更好的问题是：你为什么要使用_InterlockedCompareExchange64 ？ If you need atomic access to a variable, use std::atomic , and an optimizing compiler should compile it to the most optimized version for your architecture, and add all the necessary barriers to prevent reordering and ensure cache coherency.如果您需要对变量进行原子访问，请使用std::atomic ，优化编译器应将其编译为最适合您的架构的版本，并添加所有必要的障碍以防止重新排序并确保缓存一致性。

改进 InterlockedCompareExchange() 的原子读取

问题描述

1 个解决方案

解决方案1
1 2018-11-22 10:38:56

改进 InterlockedCompareExchange() 的原子读取

问题描述

1 个解决方案

解决方案1 1 2018-11-22 10:38:56

解决方案1
1 2018-11-22 10:38:56