简体   繁体   English

为什么CompareAndSwap指令被认为是昂贵的?

[英]Why is CompareAndSwap instruction considered expensive?

Why is CompareAndSwap instruction considered expensive? 为什么CompareAndSwap指令被认为是昂贵的?

I read in a book: 我读了一本书:

"Memory barriers are expensive, about as expensive as an atomic compareAndSet() instruction." “内存障碍很昂贵,与原子compareAndSet()指令一样昂贵。”

Thanks! 谢谢!

"CAS isn't appreciably different than a normal store. Some of the misinformation regarding CAS probably arises from the original implementation of lock:cmpxchg (CAS) on Intel processors. The lock: prefix caused the LOCK# signal to be asserted, acquiring exclusive access to the bus. This didn't scale of course. Subsequent implementations of lock:cmpxchg leverage cache coherency protocol -- typically snoop-based MESI -- and don't assert LOCK#." “CAS与正常商店的区别并不明显。有关CAS的一些错误信息可能源于英特尔处理器上最初的lock:cmpxchg(CAS)实现.lock:前缀导致LOCK#信号被置位,获得独占权访问总线。这当然没有扩展。随后的锁实现:cmpxchg利用缓存一致性协议 - 通常是基于snoop的MESI - 并且不会断言LOCK#。 - David Dice, Biased locking in HotSpot - David Dice, HotSpot中的偏向锁定

"Memory barriers are expensive, about as expensive as an atomic compareAndSet() instruction." “内存障碍很昂贵,与原子compareAndSet()指令一样昂贵。”

This is quite true. 这是真的。
Eg on x86, a proper CAS on a multi-processor system has a lock prefix. 例如,在x86上,多处理器系统上的正确CAS具有锁定前缀。
The lock prefix results in a full memory barrier: 锁定前缀会导致完整的内存屏障:

"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." “...锁定操作序列化所有未完成的加载和存储操作(即等待它们完成)。” ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor." ......“锁定操作相对于所有其他内存操作和所有外部可见事件都是原子操作。只有指令获取和页表访问才能传递锁定指令。锁定指令可用于同步由一个处理器写入的数据并由另一个处理器读取“。 - Intel® 64 and IA-32 Architectures Software Developer's Manual , Chapter 8.1.2. - 英特尔®64和IA-32架构软件开发人员手册 ,第8.1.2章。

A memory barrier is in fact implemented as a dummy LOCK OR or LOCK AND in both the .NET and the JAVA JIT on x86/x64. 事实上LOCK AND在x86 / x64上的.NETJAVA JIT中 ,内存屏障实现为虚拟LOCK ORLOCK AND
On x86, CAS results in a full memory barrier. 在x86上,CAS会产生完整的内存屏障。

On PPC, it is different. 在PPC上,它是不同的。 An LL/SC pair - lwarx & stwcx - can be used to load the memory operand into a register, then either write it back if there was no other store to the target location, or retry the whole loop if there was. LL / SC对 - lwarxstwcx - 可用于将内存操作数加载到寄存器中,然后在目标位置没有其他存储时将其写回,或者如果存在则重试整个循环。 An LL/SC can be interrupted. LL / SC可以被中断。
It also does not mean an automatic full fence. 它也不意味着自动全栅栏。
Performance characteristics and behaviour can be very different on different architectures. 不同体系结构的性能特征和行为可能会有很大差异。
But then again - a weak LL/SC is not CAS. 但话说回来 - 弱LL / SC不是CAS。

That's because they introduce extra overhead for making the operation atomic. 那是因为它们为操作原子引入了额外的开销。 The underlying platform will have to suppress optimizations (like caching) and suspend threads execution for facilitating the barrier and that requires lots of extra effort. 底层平台必须抑制优化(如缓存)并暂停线程执行以促进障碍,这需要大量额外的工作。 While that extra activity is in progress threads can't execute and so the overall program incurs a time delay. 当额外的活动正在进行时,线程无法执行,因此整个程序会产生时间延迟。

"expensive" is very relative here. “昂贵”在这里非常相对。 It's absolutely insignificant compared with, say, a harddisk access. 与硬盘访问相比,这绝对是微不足道的。 But RAM bus speed has not kept up with the speed of modern CPUs, and compared with arithmetic operations inside the CPU, accessing the RAM directly (ie non-cached) is quite expensive. 但是RAM总线速度跟不上现代CPU的速度,并且与CPU内部的算术运算相比,直接访问RAM(即非缓存)非常昂贵。 It can easily take 50 times as long to fetch an int from RAM than to add two registers. 从RAM中获取int可以轻松地花费50倍的时间而不是添加两个寄存器。

So, since memory barriers basically force direct RAM access (possibly for multiple CPUs), they are relatively expensive. 因此,由于内存屏障基本上强制直接RAM访问(可能用于多个CPU),因此它们相对昂贵。

I think I found the answer in my book: 我想我在书中找到了答案:

Each getAndSet() is broadcast to the bus. 每个getAndSet()都广播到总线。 because all threads must use the bus to communicate with memory, these getAndSet() calls delay all threads (cores), even those not waiting for the lock. 因为所有线程都必须使用总线与内存通信,这些getAndSet()调用会延迟所有线程(核心),甚至是那些不等待锁定的线程。

Even worse, the getAndSet() call forces other processors to discard their own cached copies of the lock, so every spinning thread encounters a cache miss almost every time, and must use the bus to fetch the new, but unchanged value. 更糟糕的是,getAndSet()调用强制其他处理器丢弃它们自己的锁缓存副本,因此每个旋转线程几乎每次都会遇到缓存未命中,并且必须使用总线来获取新的但未更改的值。

In general, atomic operations are expensive because they require cross-CPU synchronization. 通常,原子操作很昂贵,因为它们需要跨CPU同步。 A "normal" operation is allowed to operate on cached data, allowing extra speed. 允许“正常”操作对缓存数据进行操作,从而允许额外的速度。 Take for example, on a 2 cpu system: 例如,在2个CPU系统上:

Thread 1 线程1

while (1) x++;

Thread 2 线程2

while (1) x++;

Because increment is not an atomic operation or protected by a memory barrier, the results of this are pretty much undefined. 因为增量不是原子操作或受内存屏障保护,所以结果几乎没有定义。 You don't know how x will be incremented, or it could even get corrupted. 你不知道x将如何递增,或者它甚至可能被破坏。

Thread 1 线程1

while (1) atomicIncrement(&x);

Thread 2 线程2

while (1) atomicIncrement(&x);

Now, you are trying to get well defined behavior - no matter the ordering, x must increment one by one. 现在,您正在尝试获得良好定义的行为 - 无论顺序如何,x必须逐个增加。 If the two threads are running on different CPUs, they have to either reduce the amount of allowed caching or otherwise "compare notes" to make sure that something sensible happens. 如果两个线程在不同的CPU上运行,则它们必须减少允许的缓存量或以其他方式“比较注释”以确保发生合理的事情。

This extra overhead can be quite expensive, and it's the general cause of the claim that atomic operations are slow. 这种额外的开销可能非常昂贵,并且这是声称原子操作很慢的一般原因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM