简体   繁体   中英

improve atomic read from InterlockedCompareExchange()

Assuming architecture is ARM64 or x86-64.

I want to make sure if these two are equivalent:

  1. a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
  2. MyBarrier(); a = *(volatile __int64*)p; MyBarrier();

Where MyBarrier() is a memory barrier (hint) of compiler level, like __asm__ __volatile__ ("" ::: "memory") . So method 2 is supposed to be faster than method 1.

I heard that _Interlocked() functions would also imply memory barrier of both compiler and hardware level.

I heard that read (proper-aligned) intrinsic data is atomic on these architectures, but I am not sure if method 2 could be widely used?

(ps. because I think CPU will handle data dependency automatically so hardware barrier is not much considered here.)

Thank you for any advise / correction on this.


Here is some benchmarks on Ivy Bridge (i5 laptop).

(1E+006 loops: 27ms ):

; __int64 a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR val$[rsp], rbx

(1E+006 loops: 27ms ):

; __faststorefence(); __int64 a = *(volatile __int64*)p;
lock or DWORD PTR [rsp], 0
mov rcx, QWORD PTR val$[rsp]

(1E+006 loops: 7ms ):

; _mm_sfence(); __int64 a = *(volatile __int64*)p;
sfence
mov rcx, QWORD PTR val$[rsp]

(1E+006 loops: 1.26ms , not synchronized?):

; __int64 a = *(volatile __int64*)p;
mov rcx, QWORD PTR val$[rsp]

For the second version to be functionally equivalent, you obviously need atomic 64-bit reads, which is true on your platform.

However, _MemoryBarrier() is not a "hint to the compiler". _MemoryBarrier() on x86 prevents compiler and CPU reordering, and also ensures global visibility after the write. You also probably only need the first _MemoryBarrier() , the second one could be replaced with a _ReadWriteBarrier() unless a is also a shared variable - but you don't even need that since you are reading through a volatile pointer, which will prevent any compiler reordering in MSVC.

When you create this replacement, you basically end up with pretty much the same result :

// a = _InterlockedCompareExchange64((__int64*)&val, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR __int64 val, r8 ; val

// _MemoryBarrier(); a = *(volatile __int64*)&val;
lock or DWORD PTR [rsp], r8d
mov rax, QWORD PTR __int64 val ; val

Running these two in a loop, on my i7 Ivy Bridge laptop, gives equal results, within 2-3%.

However, with two memory barriers, the "optimized version" is actually around 2x slower.

So the better question is: Why are you using _InterlockedCompareExchange64 at all? If you need atomic access to a variable, use std::atomic , and an optimizing compiler should compile it to the most optimized version for your architecture, and add all the necessary barriers to prevent reordering and ensure cache coherency.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM