简体   繁体   中英

Can atomic operations on a non-atomic<> pointer be safe and faster than atomic<>?

I have a dozen threads reading a pointer, and one thread that may change that pointer maybe once an hour or so.

The readers are super, super, super time-sensitive. I hear that atomic<char**> or whatever is the speed of going to main memory, which I want to avoid.

In modern (say, 2012 and later) server and high-end desktop Intel, can an 8-byte-aligned regular pointer be guaranteed not to tear if read and written normally? A test of mine runs an hour without seeing a tear.

Otherwise, would it be any better (or worse) if I do the write atomically and the reads normally? For instance by making a union of the two?

Note there are other questions about mixing atomic and non-atomic operations, that don't specify CPUs, and the discussion devolves into language lawyerism. This question isn't about the spec, but rather what exactly will happen, including whether we know what will happen where the spec is undefined.

x86 will never tear an asm load or store to an aligned pointer-width value. That part of this question, and your other question ( C++11 on modern Intel: am I crazy or are non-atomic aligned 64-bit load/store actually atomic? ) are both duplicates of Why is integer assignment on a naturally aligned variable atomic on x86?

This is part of why atomic<T> is so cheap for compilers to implement, and why there's no downside to using it.

The only real cost of reading an atomic<T> on x86 is that it can't optimize into a register across multiple reads of the same var. But you need to make that happen anyway for your program to work (ie to have threads notice updates to the pointer). On non-x86, only mo_relaxed is as cheap as a plain asm load, but x86's strong memory model makes even seq_cst loads cheap.

If you use the pointer multiple times in one function, do T* local_copy = global_ptr; so the compiler can keep local_copy in a register. Think of this as loading from memory into a private register, because that's exactly how it will compile. Operations on atomic objects don't optimize away, so if you want to re-read the global pointer once per loop, write your source that way. Or once outside the loop: write your source that way and let the compiler manage the local var.


Apparently you keep trying to avoid atomic<T*> because you have a huge misconception about performance of std::atomic::load() pure-load operations. std::atomic::store() is somewhat slower unless you use a memory_order of release or relaxed, but on x86 std::atomic has no extra cost for seq_cst loads.

There is no performance advantage to avoiding atomic<T*> here. It will do exactly what you need safely and portably, and with high performance for your read-mostly use case. Each core reading it can access a copy in its private L1d cache. A write invalidates all copies of the line so the writer has exclusive ownership (MESI), but the next read from each core will get a shared copy that can stay hot in its private caches again.

(This is one of the benefits of coherent caches: readers don't have to keep checking some single shared copy. Writers are forced to make sure there are no stale copies anywhere before they can write. This is all done by hardware, not with software asm instructions. All ISAs that we run multiple C++ threads across have cache-coherent shared memory, which is why volatile sort of works for rolling your own atomics ( but don't do it ), like people used to have to do before C++11. Or like you're trying to do without even using volatile , which only works in debug builds. Definitely don't do that !)

Atomic loads compile to the same instructions compilers use for everything else, eg mov . At an asm level, every aligned load and store is an atomic operation (for power of 2 sizes up to 8 bytes). atomic<T> only has to stop the compiler from assuming that no other threads are writing the object between accesses.

(Unlike pure load / pure store, atomicity of a whole RMW doesn't happen for free ; ptr_to_int++ would compile to lock add qword [ptr], 4 . But in the uncontended case that's still vastly faster than a cache miss all the way to DRAM, just needing a "cache lock" inside the core that has exclusive ownership of the line. Like 20 cycles per operation if you're doing nothing but that back-to-back on Haswell ( https://agner.org/optimize/ ), but just one atomic RMW in the middle of other code can overlap nicely with surrounding ALU operations.)

Pure read-only access is where lockless code using atomics really shines compared to anything that needs a RWlock - atomic<> readers don't contend with each other so the read-side scales perfectly for a use-case like this ( or RCU or a SeqLock ).

On x86 a seq_cst load (the default ordering) doesn't need any barrier instructions, thanks to x86's hardware memory-ordering model (program order loads/stores, plus a store buffer with store forwarding). That means you get full performance in the read side that uses your pointer without having to weaken to acquire or consume memory order.

If store performance was a factor, you could use std::memory_order_release so stores can also just be plain mov , without needing to drain the store buffer with mfence or xchg .


I hear that atomic<char**> or whatever is the speed of going to main memory

Whatever you read has misled you.

Even getting data between cores doesn't require going to actual DRAM, just to shared last-level cache. Since you're on Intel CPUs, L3 cache is a backstop for cache coherency.

Right after a core writes a cache line, it will still be in its private L1d cache in MESI Modified state (and Invalid in every other cache; this is how MESI maintains cache coherency = no stale copies of lines anywhere). A load on another core from that cache line will therefore miss in the private L1d and L2 caches, but L3 tags will tell the hardware which core has a copy of the line. A message goes over the ring bus to that core, getting it to write-back the line to L3. From there it can be forwarded to the core still waiting for the load data. This is pretty much what inter-core latency measures - the time between a store on one core and getting the value on another core.

The time this takes (inter-core latency) is roughly similar to a load that misses in L3 cache and has to wait for DRAM, like maybe 40ns vs. 70ns depending on the CPU. Perhaps this is what you read. (Many-core Xeons have more hops on the ring bus and more latency between cores, and from cores to DRAM.)

But that's only for the first load after a write. The data is cached by the L2 and L1d caches on the core that loaded it, and in Shared state in L3. After that, any thread that reads the pointer frequently will tend to make the line stay hot in the fast private L2 or even L1d cache on the core running that thread. L1d cache has 4-5 cycle latency, and can handle 2 loads per clock cycle.

And the line will be in Shared state in L3 where any other core can hit, so only the first core pays the full inter-core latency penalty.

(Before Skylake-AVX512, Intel chips use an inclusive L3 cache so the L3 tags can work as a snoop filter for directory-based cache coherence between cores. If a line is in Shared state in some private cache, it's also valid in Shared state in L3. Even on SKX where L3 cache doesn't maintain the inclusive property, the data will be there in L3 for a while after sharing it between cores.)

In debug builds, every variable is stored/reloaded to memory between C++ statements. The fact that this isn't (usually) 400 times slower than normal optimized builds shows that memory access isn't too slow in the un-contended case when it hits in cache. (Keeping data in registers is faster than memory so debug builds are pretty bad in general. If you made every variable atomic<T> with memory_order_relaxed , that would be somewhat similar to compiling without optimization, except for stuff like ++ ). Just to be clear, I'm not saying that atomic<T> makes your code run at debug-mode speed. A shared variable that might have changed asynchronously needs to be reloaded from memory (through the cache) every time the source mentions it, and atomic<T> does that.


As I said, reading an atomic<char**> ptr will compile to just a mov load on x86, no extra fences, exactly the same as reading a non-atomic object.

Except that it blocks some compile-time reordering, and like volatile stops the compiler from assuming the value never changes and hoisting loads out of loops. It also stops the compiler from inventing extra reads. See https://lwn.net/Articles/793253/


I have a dozen threads reading a pointer, and one thread that may change that pointer maybe once an hour or so.

You might want RCU even if that means copying a relatively large data structure for each of those very infrequent writes. RCU makes readers truly read-only so read-side scaling is perfect.

Other answers to your C++11/14/17: a readers/writer lock... without having a lock for the readers? suggested things involving multiple RWlocks to make sure a reader could always take one. That still involves an atomic RMW on some shared cache line that all readers contend to modify. If you have readers that take an RWlock, they probably will stall for inter-core latency as they get the cache line containing the lock into MESI Modified state.

(Hardware Lock Elision used to solve the problem of avoiding contention between readers but it's been disabled by microcode updates on all existing hardware .)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM