取票自旋锁互斥体的内存顺序

Question

Suppose I have the following ticket-taking spinlock mutex implementation (in C using GCC atomic builtins).假设我有以下取票自旋锁互斥体实现（在 C 中使用 GCC 原子内置函数）。 As I understand it, the use of the "release" memory order in the unlock function is correct.据我了解，在解锁功能中使用“释放”内存顺序是正确的。 I'm unsure, though, about the lock function.不过，我不确定锁定功能。 Because this is a ticket-taking mutex, there's a field indicating the next ticket number to be handed out, and a field to indicate which ticket number currently holds the lock.因为这是一个取票互斥锁，所以有一个字段指示下一个要分发的票号，还有一个字段指示当前持有锁的票号。 I've used acquire-release on the ticket increment and acquire on the spin load.我在票据增量上使用了 acquire-release，在自旋负载上使用了 acquire。 Is that unnecessarily strong, and if so, why?这是否过于强大，如果是，为什么？

Separately, should those two fields (ticket and serving) be spaced so that they're on different cache lines, or does that not matter?另外，这两个字段（ticket 和 serving）是否应该间隔开以便它们位于不同的缓存行上，或者这无关紧要？ I'm mainly interested in arm64 and amd64.我主要对 arm64 和 amd64 感兴趣。

typedef struct {
        u64 ticket;
        u64 serving;
} ticket_mutex;

void
ticket_mutex_lock(ticket_mutex *m)
{
        u64 my_ticket = __atomic_fetch_add(&m->ticket, 1, __ATOMIC_ACQ_REL);
        while (my_ticket != __atomic_load_n(&m->serving, __ATOMIC_ACQUIRE));
}

void
ticket_mutex_unlock(ticket_mutex *m)
{
        (void) __atomic_fetch_add(&m->serving, 1, __ATOMIC_RELEASE);
}

UPDATE: based on the advice in the accepted answer, I've adjusted the implementation to the following.更新：根据已接受答案中的建议，我已将实施调整为以下内容。 This mutex is intended for the low-contention case.此互斥量用于低争用情况。

typedef struct {
        u32 ticket;
        u32 serving;
} ticket_mutex;

void
ticket_mutex_lock(ticket_mutex *m)
{
        u32 my_ticket = __atomic_fetch_add(&m->ticket, 1, __ATOMIC_RELAXED);
        while (my_ticket != __atomic_load_n(&m->serving, __ATOMIC_ACQUIRE)) {
                #ifdef __x86_64__
                __asm __volatile ("pause");
                #endif
        }
}

void
ticket_mutex_unlock(ticket_mutex *m)
{
        u32 my_ticket = __atomic_load_n(&m->serving, __ATOMIC_RELAXED);
        (void) __atomic_store_n(&m->serving, my_ticket+1, __ATOMIC_RELEASE);
}

Answer 1

m->ticket increment only needs to be RELAXED . m->ticket increment 只需要RELAXED 。 You only need each thread to get a different ticket number;您只需要每个线程获得不同的票号即可； it can happen as early or late as you want wrt.它可以根据您的意愿早晚发生。 other operations in the same thread.同一线程中的其他操作。

load(&m->serving, acquire) is the operation that orders the critical section, preventing those from starting until we've synchronized-with a RELEASE operation in the unlock function of the previous holder of the lock. load(&m->serving, acquire)是对临界区进行排序的操作，在我们与前一个锁持有者的unlock函数中的RELEASE操作同步之前，阻止临界区启动。 So the m->serving loads needs to be at least acquire .因此m->serving loads 至少需要acquire 。

Even if the m->ticket++ doesn't complete until after an acquire load of m->serving , that's fine.即使m->ticket++在m->serving的获取负载之后才完成，也没关系。 The while condition still determines whether execution proceeds (non-speculatively) into the critical section. while条件仍然决定执行是否继续（非推测性地）进入临界区。 Speculative execution into the critical section is fine, and good since it probably means it's ready commit sooner, reducing the time with the lock held.对关键部分的推测执行很好，而且很好，因为它可能意味着它可以更快地提交，从而减少持有锁的时间。

Extra ordering on the RMW operation won't make it any faster locally or in terms of inter-thread visibility, and would slow down the thread taking the lock. RMW 操作的额外排序不会使其在本地或线程间可见性方面更快，并且会减慢线程获取锁的速度。

One cache line or two一两个缓存行

For performance, I think with high contention, there are advantages to keeping the members in separate cache lines.对于性能，我认为在竞争激烈的情况下，将成员保持在单独的缓存行中是有好处的。

Threads needing exclusive ownership of the cache line to get a ticket number won't contend with the thread unlocking .serving , so those inter-thread latency delays can happen in parallel.需要缓存行的独占所有权才能获得票号的线程不会与线程解锁.serving ，因此这些线程间延迟延迟可以并行发生。

With multiple cores in the spin-wait while(load(serving)) loop, they can hit in their local L1d cache until something invalidates shared copies of the line, not creating any extra traffic.在自旋等待while(load(serving))循环中有多个核心，它们可以命中本地 L1d 缓存，直到某些东西使线路的共享副本无效，而不产生任何额外的流量。 But wasting a lot of power unless you use something like x86 _mm_pause() , as well as wasting execution resources that could be shared with another logical core on the same physical.但是，除非您使用 x86 _mm_pause()之类的东西，否则会浪费大量电量，并且会浪费可以与同一物理上的另一个逻辑内核共享的执行资源。 x86 pause also avoids a branch mispredict when leaving the spin loop. x86 pause还避免了在离开自旋循环时发生分支预测错误。 Related:有关的：

Exponential backoff up to some number of pauses between checks is a common recommendation, but here we can do better: A number of pause instructions between checks that scales with my_ticket - m->serving , so you check more often when your ticket is coming up.指数退避到检查之间的一定数量的暂停是一个常见的建议，但在这里我们可以做得更好：检查之间的一些pause指令与my_ticket - m->serving一起扩展，所以当你的票出现时你会更频繁地检查.

In really high contention cases, fallback to OS-assisted sleep/wake is appropriate if you'll be waiting for long, like Linux futex .在竞争非常激烈的情况下，如果您要等待很长时间，例如 Linux futex ，则回退到操作系统辅助睡眠/唤醒是合适的。 Or since we can see how close to the head of the queue we are, yield , nanosleep , or futex if your wait interval will be more than 3 or 8 ticket numbers or whatever.或者因为我们可以看到我们离队列的头部有多近，如果您的等待间隔将超过 3 或 8 个票号或其他任何东西，则yield 、 nanosleep或futex 。 (Tunable depending on how long it takes to serve a ticket.) （可根据服务票证所需的时间进行调整。）

(Using futex , you might introduce a read of m->ticket into the unlock to figure out if there might be any threads sleeping, waiting for a notify. Like C++20 atomic<>.wait() and atomic.notify_all() . Unfortunately I don't know a good way to figure out which thread to notify, instead of waking them all up to check if they're the lucky winner. （使用futex ，您可以将m->ticket的读取引入到解锁中，以确定是否有任何线程正在休眠，等待通知。像 C++20 atomic<>.wait()和atomic.notify_all() . 不幸的是，我不知道一个好方法来确定要通知哪个线程，而不是唤醒他们所有人来检查他们是否是幸运的赢家。

With low average contention, you should keep both in the same cache line .在低平均争用的情况下，您应该将两者放在同一个缓存行中。 An access to .ticket is always immediately followed by a load of .serving .对.ticket的访问总是紧接着加载.serving 。 In the unlocked no-contention case, this means only one cache line is bouncing around, or having to stay hot for the same core to take/release the lock.在未锁定的无争用情况下，这意味着只有一个缓存行在跳动，或者必须保持热状态才能让同一个核心获取/释放锁。

If the lock is already held, the thread wanting to unlock needs exclusive ownership of the cache line to RMW or store.如果锁已经持有，想要解锁的线程需要对缓存行的独占所有权才能 RMW 或存储。 It loses this whether another core does an RMW or just a pure load on the line containing .serving .无论是另一个核心执行 RMW 还是仅在包含.serving的行上执行纯负载，它都会丢失此信息。

There won't be too many cases where multiple waiters are all spinning on the same lock, and where new threads getting a ticket number delay the unlock, and its visibility to the thread waiting for it.不会有太多的情况，多个等待者都在同一个锁上旋转，并且新线程获得票号延迟解锁，以及它对等待它的线程的可见性。

This is my intuition, anyway;无论如何，这是我的直觉； it's probably hard to microbenchmark, unless a cache-miss atomic RMW stops later load from even starting to request the later line, in which case you could have two cache-miss latencies in taking the lock.它可能很难进行微基准测试，除非缓存未命中原子 RMW 停止稍后加载甚至开始请求后面的行，在这种情况下，您可能有两次缓存未命中延迟来获取锁。

Avoiding an atomic RMW in the unlock?在解锁中避免原子 RMW？

The thread holding the lock knows it has exclusive ownership, no other thread will be modifying m->serving concurrently.持有锁的线程知道它拥有独占所有权，没有其他线程会同时修改m->serving 。 If you had the lock owner remember its own ticket number, you could optimize the unlock to just a store.如果您让锁拥有者记住自己的票号，您可以优化解锁，只锁定一家商店。

void ticket_mutex_unlock(ticket_mutex *m, uint32_t ticket_num)
{
        (void) __atomic_store_n(&m->serving, ticket_num+1, __ATOMIC_RELEASE);
}

Or without that API change (to return an integer from u32 ticket_mutex_lock() )或者没有那个 API 更改（从u32 ticket_mutex_lock()返回一个整数）

void ticket_mutex_unlock(ticket_mutex *m)
{
        uint32_t ticket = __atomic_load_n(&m->serving, __ATOMIC_RELAXED);  // we already own the lock
        // and no other thread can be writing concurrently, so a non-atomic increment is safe
        (void) __atomic_store_n(&m->serving, ticket+1, __ATOMIC_RELEASE);
}

This has a nice efficiency advantage on ISAs that need LL/SC retry loops for atomic RMWs, where spurious failure from another core reading the value can happen.这在需要原子 RMW 的 LL/SC 重试循环的 ISA 上具有很好的效率优势，其中可能会发生来自另一个核心读取值的虚假故障。 And on x86 where the only possible atomic RMW is a full barrier, stronger even than needed for C seq_cst semantics.在 x86 上，唯一可能的原子 RMW 是一个完整的屏障，甚至比 C seq_cst语义所需的还要强大。

BTW, the lock fields would be fine as uint32_t .顺便说一句，锁定字段可以作为uint32_t 。 You're not going to have 2^32 threads waiting for a lock.您不会有 2^32 个线程在等待锁。 So I used uint32_t instead of u64 .所以我使用uint32_t而不是u64 。 Wrap-around is well-defined.环绕是明确定义的。 Even subtraction like ticket - serving Just Works, even across that wrapping boundary, like 1 - 0xffffffffUL gives 2, so you can still calculate how close you are to being served, for sleep decisions.甚至像ticket - serving Just Works 这样的减法，甚至跨越那个包装边界，比如1 - 0xffffffffUL给出 2，所以你仍然可以计算你离被服务有多近，用于睡眠决策。

Not a big deal on x86-64, only saving a bit of code size, and probably not a factor at all on AArch64.在 x86-64 上没什么大不了的，只是节省了一点代码大小，在 AArch64 上可能根本不是一个因素。 But will help significantly on some 32-bit ISAs.但对某些 32 位 ISA 有很大帮助。

取票自旋锁互斥体的内存顺序

问题描述

1 个解决方案

解决方案1
2 已采纳 2022-12-15 20:33:53

One cache line or two一两个缓存行

Avoiding an atomic RMW in the unlock?在解锁中避免原子 RMW？

取票自旋锁互斥体的内存顺序

问题描述

1 个解决方案

解决方案1 2 已采纳 2022-12-15 20:33:53

One cache line or two一两个缓存行

Avoiding an atomic RMW in the unlock?在解锁中避免原子 RMW？

解决方案1
2 已采纳 2022-12-15 20:33:53