简体   繁体   English

std :: atomic的锁在哪里?

[英]Where is the lock for a std::atomic?

If a data structure has multiple elements in it, the atomic version of it cannot (always) be lock-free. 如果数据结构中包含多个元素,则它的原子版本不能(始终)无锁。 I was told that this is true for larger types because the CPU can not atomically change the data without using some sort of lock. 我被告知这对于较大的类型是正确的,因为CPU不能在不使用某种锁的情况下以原子方式更改数据。

for example: 例如:

#include <iostream>
#include <atomic>

struct foo {
    double a;
    double b;
};

std::atomic<foo> var;

int main()
{
    std::cout << var.is_lock_free() << std::endl;
    std::cout << sizeof(foo) << std::endl;
    std::cout << sizeof(var) << std::endl;
}

the output (Linux/gcc) is: 输出(Linux / gcc)是:

0
16
16

Since the atomic and foo are the same size, I don't think a lock is stored in the atomic. 由于原子和foo的大小相同,我不认为锁存储在原子中。

My question is: 我的问题是:
If an atomic variable uses a lock, where is it stored and what does that mean for multiple instances of that variable ? 如果一个原子变量使用一个锁,它存储在哪里,这对该变量的多个实例意味着什么?

The usual implementation is a hash-table of mutexes (or even just simple spinlocks without a fallback to OS-assisted sleep/wakeup), using the address of the atomic object as a key . 通常的实现是使用原子对象的地址作为键,互斥体的哈希表(甚至只是简单的自旋锁,没有回退到OS辅助的睡眠/唤醒) The hash function might be as simple as just using the low bits of the address as an index into a power-of-2 sized array, but @Frank's answer shows LLVM's std::atomic implementation does XOR in some higher bits so you don't automatically get aliasing when objects are separated by a large power of 2 (which is more common than any other random arrangement). 哈希函数可能就像使用地址的低位作为2次幂大小的数组的索引一样简单,但是@Frank的答案显示LLVM的std :: atomic实现在某些更高的位中进行异或运算所以你不要当对象被2的大功率分开时,t会自动获得混叠(这比任何其他随机排列更常见)。

I think (but I'm not sure) that g++ and clang++ are ABI-compatible; 我认为(但我不确定)g ++和clang ++是ABI兼容的; ie that they use the same hash function and table, so they agree on which lock serializes access to which object. 即他们使用相同的散列函数和表,因此他们同意哪个锁序列化访问哪个对象。 The locking is all done in libatomic , though, so if you dynamically link libatomic then all code inside the same program that calls __atomic_store_16 will use the same implementation; 锁定都是在libatomic完成的,所以如果你动态链接libatomic那么调用__atomic_store_16的同一个程序中的所有代码__atomic_store_16将使用相同的实现; clang++ and g++ definitely agree on which function names to call, and that's enough. clang ++和g ++肯定同意调用哪些函数名,这就足够了。 (But note that only lock-free atomic objects in shared memory between different processes will work: each process has its own hash table of locks . Lock-free objects are supposed to (and in fact do) Just Work in shared memory on normal CPU architectures, even if the region is mapped to different addresses.) (但请注意, 在不同进程之间的共享内存只有无锁原子对象才有效:每个进程都有自己的锁定哈希表 。无锁对象应该(实际上是)只需在普通CPU的共享内存中工作体系结构,即使该区域映射到不同的地址。)

Hash collisions mean that two atomic objects might share the same lock. 散列冲突意味着两个原子对象可能共享同一个锁。 This is not a correctness problem, but it could be a performance problem : instead of two pairs of threads separately contending with each other for two different objects, you could have all 4 threads contending for access to either object. 这不是一个正确性问题,但它可能是一个性能问题 :您可以让所有4个线程争用访问任一对象,而不是两个线程分别相互竞争两个不同的对象。 Presumably that's unusual, and usually you aim for your atomic objects to be lock-free on the platforms you care about. 大概这是不寻常的,通常你的目标是你的原子对象在你关心的平台上无锁。 But most of the time you don't get really unlucky, and it's basically fine. 但大多数时候你并没有真正走运,而且基本上没问题。

Deadlocks aren't possible because there aren't any std::atomic functions that try to take the lock on two objects at once. 死锁是不可能的,因为没有任何std::atomic函数试图同时锁定两个对象。 So the library code that takes the lock never tries to take another lock while holding one of these locks. 因此,获取锁的库代码永远不会尝试在持有其中一个锁的同时获取另一个锁。 Extra contention / serialization is not a correctness problem, just performance. 额外争用/序列化不是正确性问题,只是性能问题。


x86-64 16-byte objects with GCC vs. MSVC : x86-64 GCC与MSVC的16字节对象

As a hack, compilers can use lock cmpxchg16b to implement 16-byte atomic load/store, as well as actual read-modify-write operations. 作为一个hack,编译器可以使用lock cmpxchg16b来实现16字节的原子加载/存储,以及实际的读 - 修改 - 写操作。

This is better than locking, but has bad performance compared to 8-byte atomic objects (eg pure loads contend with other loads). 这比锁定更好,但与8字节原子对象相比具有不良性能(例如纯负载与其他负载竞争)。 It's the only documented safe way to atomically do anything with 16 bytes 1 . 它是唯一一个以16字节1自动执行任何操作的安全方法。

AFAIK, MSVC never uses lock cmpxchg16b for 16-byte objects, and they're basically the same as a 24 or 32 byte object. AFAIK,MSVC永远不会将lock cmpxchg16b用于16字节对象,它们基本上与24或32字节对象相同。

gcc6 and earlier inlined lock cmpxchg16b when you compile with -mcx16 (cmpxchg16b unfortunately isn't baseline for x86-64; first-gen AMD K8 CPUs are missing it.) gcc6和早期内联lock cmpxchg16b当你编译-mcx16 (cmpxchg16b不幸的是,不是x86-64的基准;第一代AMD K8 CPU都不翼而飞了。)

gcc7 decided to always call libatomic and never report 16-byte objects as lock-free, even though libatomic functions would still use lock cmpxchg16b on machines where the instruction is available. gcc7决定始终调用libatomic并且永远不会将16字节对象报告为无锁,即使libatomic函数仍然在指令可用的机器上使用lock cmpxchg16b See is_lock_free() returned false after upgrading to MacPorts gcc 7.3 . 升级到MacPorts gcc 7.3后,请参阅is_lock_free()返回false The gcc mailing list message explaining this change is here . 解释此更改的gcc邮件列表消息在此处

You can use a union hack to get a reasonably cheap ABA pointer+counter on x86-64 with gcc/clang: How can I implement ABA counter with c++11 CAS? 您可以使用union hack在x86-64上使用gcc / clang获得一个相当便宜的ABA指针+计数器: 如何使用c ++ 11 CAS实现ABA计数器? . lock cmpxchg16b for updates of both pointer and counter, but simple mov loads of just the pointer. lock cmpxchg16b以获取指针和计数器的更新,但只是指针的简单mov加载。 This only works if the 16-byte object is actually lock-free using lock cmpxchg16b , though. 这只适用于使用lock cmpxchg16b实际上无锁的16字节对象。


Footnote 1 : movdqa 16-byte load/store is atomic in practice on some (but not all) x86 microarchitectures, and there's no reliable or documented way to detect when it's usable. 脚注1movdqa 16字节加载/存储在某些(但不是全部)x86微体系结构中实际上是原子的,并且没有可靠或记录的方法来检测它何时可用。 See Why is integer assignment on a naturally aligned variable atomic on x86? 请参阅为什么在x86上对自然对齐的变量进行整数赋值? , and SSE instructions: which CPUs can do atomic 16B memory operations? SSE指令:哪些CPU可以进行原子16B内存操作? for an example where K10 Opteron shows tearing at 8B boundaries only between sockets with HyperTransport. 例如,K10 Opteron只显示在具有HyperTransport的套接字之间的8B边界处撕裂。

So compiler writers have to err on the side of caution and can't use movdqa the way they use SSE2 movq for 8-byte atomic load/store in 32-bit code. 因此编译器编写者必须小心谨慎,并且不能movdqa在32位代码中使用SSE2 movq进行8字节原子加载/存储一样使用movdqa It would be great if CPU vendors could document some guarantees for some microarchitectures, or add CPUID feature bits for atomic 16, 32, and 64-byte aligned vector load/store (with SSE, AVX, and AVX512). 如果CPU供应商可以记录某些微体系结构的某些保证,或者为原子16,32和64字节对齐的向量加载/存储(使用SSE,AVX和AVX512)添加CPUID功能位,那将是很好的。 Maybe which mobo vendors could disable in firmware on funky many-socket machines that use special coherency glue chips that don't transfer whole cache lines atomically. 也许哪些主板供应商可以在使用特殊一致性胶水芯片的时髦多插槽机器上的固件中禁用,这些芯片不会原子地传输整个缓存线。

The easiest way to answer such questions is generally to just look at the resulting assembly and take it from there. 回答这些问题的最简单方法通常是查看生成的装配并从那里取出。

Compiling the following (I made your struct larger to dodge crafty compiler shenanigans): 编译以下内容(我使你的结构更大,以躲避狡猾的编译器恶作剧):

#include <atomic>

struct foo {
    double a;
    double b;
    double c;
    double d;
    double e;
};

std::atomic<foo> var;

void bar()
{
    var.store(foo{1.0,2.0,1.0,2.0,1.0});
}

In clang 5.0.0 yields the following under -O3: see on godbolt 在clang 5.0.0中,在-O3下产生以下内容: 请参阅godbolt

bar(): # @bar()
  sub rsp, 40
  movaps xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = [1.000000e+00,2.000000e+00]
  movaps xmmword ptr [rsp], xmm0
  movaps xmmword ptr [rsp + 16], xmm0
  movabs rax, 4607182418800017408
  mov qword ptr [rsp + 32], rax
  mov rdx, rsp
  mov edi, 40
  mov esi, var
  mov ecx, 5
  call __atomic_store

Great, the compiler delegates to an intrinsic ( __atomic_store ), that's not telling us what's really going on here. 很好,编译器委托给一个内在的( __atomic_store ),这并没有告诉我们这里到底发生了什么。 However, since the compiler is open source, we can easily find the implementation of the intrinsic (I found it in https://github.com/llvm-mirror/compiler-rt/blob/master/lib/builtins/atomic.c ): 但是,由于编译器是开源的,我们可以很容易地找到内在的实现(我在https://github.com/llvm-mirror/compiler-rt/blob/master/lib/builtins/atomic.c中找到它) ):

void __atomic_store_c(int size, void *dest, void *src, int model) {
#define LOCK_FREE_ACTION(type) \
    __c11_atomic_store((_Atomic(type)*)dest, *(type*)dest, model);\
    return;
  LOCK_FREE_CASES();
#undef LOCK_FREE_ACTION
  Lock *l = lock_for_pointer(dest);
  lock(l);
  memcpy(dest, src, size);
  unlock(l);
}

It seems like the magic happens in lock_for_pointer() , so let's have a look at it: 似乎魔法发生在lock_for_pointer() ,所以让我们来看看它:

static __inline Lock *lock_for_pointer(void *ptr) {
  intptr_t hash = (intptr_t)ptr;
  // Disregard the lowest 4 bits.  We want all values that may be part of the
  // same memory operation to hash to the same value and therefore use the same
  // lock.  
  hash >>= 4;
  // Use the next bits as the basis for the hash
  intptr_t low = hash & SPINLOCK_MASK;
  // Now use the high(er) set of bits to perturb the hash, so that we don't
  // get collisions from atomic fields in a single object
  hash >>= 16;
  hash ^= low;
  // Return a pointer to the word to use
  return locks + (hash & SPINLOCK_MASK);
}

And here's our explanation: The address of the atomic is used to generate a hash-key to select a pre-alocated lock. 这里是我们的解释:原子的地址用于生成一个哈希键来选择一个预先分配的锁。

From 29.5.9 of the C++ standard: 从C ++标准的29.5.9开始:

Note: The representation of an atomic specialization need not have the same size as its corresponding argument type. 注意:原子特化的表示不必与其对应的参数类型具有相同的大小。 Specializations should have the same size whenever possible, as this reduces the effort required to port existing code. 专业化应尽可能具有相同的大小,因为这减少了移植现有代码所需的工作量。 — end note - 结束说明

It is preferable to make the size of an atomic the same as the size of its argument type, although not necessary. 尽管不是必需的,但最好使原子的大小与其参数类型的大小相同。 The way to achieve this is by either avoiding locks or by storing the locks in a separate structure. 实现此目的的方法是避免锁定或将锁存储在单独的结构中。 As the other answers have already explained clearly, a hash table is used to hold all the locks. 正如其他答案已经清楚解释的那样,哈希表用于保存所有锁。 This is the most memory efficient way of storing any number of locks for all the atomic objects in use. 这是为使用中的所有原子对象存储任意数量的锁的最有效内存的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM