ARM 上的无锁 SPSC 队列实现

Question

I'm trying to write a single producer single consumer queue for ARM and I think I'm close to wrapping my head around DMB, but need some checking (I'm more familiar with std::atomic.)我正在尝试为 ARM 编写一个生产者单个消费者队列，我想我已经接近于 DMB 了，但需要一些检查（我更熟悉 std::atomic。）

Here's where I'm at:这是我所在的位置：

bool push(const_reference value)
{
    // Check for room
    const size_type currentTail = tail;
    const size_type nextTail = increment(currentTail);
    if (nextTail == head)
        return false;

    // Write the value
    valueArr[currentTail] = value;

    // Prevent the consumer from seeing the incremented tail before the
    // value is written.
    __DMB();

    // Increment tail
    tail = nextTail;

    return true;
}

bool pop(reference valueLocation)
{
    // Check for data
    const size_type currentHead = head;
    if (currentHead == tail)
        return false;

    // Write the value.
    valueLocation = valueArr[currentHead];

    // Prevent the producer from seeing the incremented head before the
    // value is written.
    __DMB();

    // Increment the head
    head = increment(head);

    return true;
}

My question is: is my DMB placement and justification accurate?我的问题是：我的 DMB 位置和理由是否准确？ Or is there still understanding that I'm missing?还是仍然理解我失踪了？ I'm particularly uncertain about whether the conditionals need some guard when dealing with the variable that's updated by the other thread (or interrupt).在处理由另一个线程（或中断）更新的变量时，我特别不确定条件是否需要一些保护。

Answer 1

A barrier there is necessary but not sufficient, you also need "acquire" semantics for loading the var modified by the other thread.那里的障碍是必要的但还不够，您还需要“获取”语义来加载由其他线程修改的 var。 (Or at least consume , but getting that without a barrier would require asm to create a data dependency. A compiler wouldn't do that after already having a control dependency.) （或者至少consume ，但是没有障碍需要 asm 来创建数据依赖项。编译器在已经拥有控制依赖项之后不会这样做。）
A single-core system can use just a compiler barrier like GNU C asm("":::"memory") or std::atomic_signal_fence(std::memory_order_release) , not dmb .单核系统可以只使用编译器屏障，例如 GNU C asm("":::"memory")或std::atomic_signal_fence(std::memory_order_release) ，而不是dmb 。 Make a macro so you can choose between SMP-safe barriers or UP (uniprocessor) barriers.制作一个宏，以便您可以在 SMP 安全屏障或 UP（单处理器）屏障之间进行选择。
head = increment(head); is a pointless reload of head , use the local copy.是head的无意义的重新加载，使用本地副本。
use std::atomic to get the necessary code-gen portably.使用std::atomic可移植地获取必要的代码生成。

You normally don't need to roll your own atomics;您通常不需要滚动自己的原子； modern compilers for ARM do implement std::atomic<T> . ARM 的现代编译器确实实现了std::atomic<T> 。 But AFAIK, no std::atomic<> implementations are aware of single-core systems to avoid actual barriers and just be safe wrt.但是据我所知，没有std::atomic<>实现知道单核系统以避免实际障碍并且只是安全的。 interrupts that can cause a context switch.可能导致上下文切换的中断。

On a single-core system, you don't need dsb , just a compiler barrier.在单核系统上，您不需要dsb ，只需要一个编译器屏障。 The CPU will preserve the illusion of asm instructions executing sequentially, in program order. CPU 将保留 asm 指令按程序顺序执行的错觉。 You just need to make sure the compiler generates asm that does things in the right order.您只需要确保编译器生成以正确顺序执行操作的 asm。 You can do that by using std::atomic with std::memory_order_relaxed , and manual atomic_signal_fence(memory_order_acquire) or release barriers.您可以通过使用std::atomic和std::memory_order_relaxed以及手动atomic_signal_fence(memory_order_acquire)或release障碍来做到这一点。 (Not atomic_thread_fence ; that would emit asm instructions, typically dsb ). （不是atomic_thread_fence ；会发出 asm 指令，通常是dsb ）。

Each thread reads a variable that the other thread modifies.每个线程读取另一个线程修改的变量。 You're correctly making the modifications release-stores by making sure they're visible only after access to the array.通过确保它们仅在访问数组后可见，您正确地进行了修改发布存储。

But those reads also need to be acquire-loads to sync with those release stores .但是这些读取也需要获取加载才能与那些发布存储同步。 Eg to make sure push isn't writing valueArr[currentTail] = value;例如，确保push没有写入valueArr[currentTail] = value; before pop finishes reading that same element.在pop完成读取相同的元素之前。 Or reading an entry before it's fully written.或者在完整写入之前阅读条目。

Without any barrier, the failure mode would be that if (currentHead == tail) return false;如果没有任何障碍，失败模式将是if (currentHead == tail) return false; doesn't actually check the value of tail from memory until after valueLocation = valueArr[currentHead];实际上并没有从 memory 直到tail valueLocation = valueArr[currentHead]; happens.发生。 Runtime load reordering can easily do that on weakly-ordered ARM.运行时负载重新排序可以在弱排序 ARM 上轻松完成。 If the load address had a data dependency on tail , that could avoid needing a barrier there on an SMP system (ARM guarantees dependency ordering in asm; the feature that mo_consume was supposed to expose).如果加载地址对tail有数据依赖，那么可以避免在 SMP 系统上需要屏障（ARM 保证 asm 中的依赖排序； mo_consume应该公开的特性）。 But if the compiler just emits a branch, that's only a control dependency, not data.但是如果编译器只是发出一个分支，那只是一个控制依赖，而不是数据。 If you were writing by hand in asm, a predicated load like ldrne r0, [r1, r2] on flags set by the compare would I think create a data dependency.如果您在 asm 中手动编写，我认为比较设置的标志上的ldrne r0, [r1, r2]之类的谓词加载会创建数据依赖关系。

Compile-time reordering is less plausible, but a compiler-only barrier is free if it's only stopping the compiler from doing something it wasn't going to do anyway.编译时重新排序不太合理，但是如果它只是阻止编译器做一些它无论如何都不会做的事情，那么一个仅编译器的障碍是免费的。

untested implementation, compiles to asm that looks ok but no other testing未经测试的实现，编译为看起来不错但没有其他测试的 asm

Do something similar for push .为push做类似的事情。 I included wrapper functions for load acquire / store release, and fullbarrier().我包含了用于加载获取/存储释放和 fullbarrier() 的包装函数。 (Equivalent of Linux kernel's smp_mb() macro, defined as a compile time or compile+runtime barrier.) （相当于 Linux 内核的smp_mb()宏，定义为编译时或编译+运行时屏障。）

#include <atomic>

#define UNIPROCESSOR


#ifdef UNIPROCESSOR
#define fullbarrier()  asm("":::"memory")   // GNU C compiler barrier
                          // atomic_signal_fence(std::memory_order_seq_cst)
#else
#define fullbarrier() __DMB()    // or atomic_thread_fence(std::memory_order_seq_cst)
#endif

template <class T>
T load_acquire(std::atomic<T> &x) {
#ifdef UNIPROCESSOR
    T tmp = x.load(std::memory_order_relaxed);
    std::atomic_signal_fence(std::memory_order_acquire);
    // or fullbarrier();  if you want to use that macro
    return tmp;
#else
    return x.load(std::memory_order_acquire);
    // fullbarrier() / __DMB();
#endif
}

template <class T>
void store_release(std::atomic<T> &x, T val) {
#ifdef UNIPROCESSOR
    std::atomic_signal_fence(std::memory_order_release);
    // or fullbarrier();
    x.store(val, std::memory_order_relaxed);
#else
    // fullbarrier() / __DMB(); before plain store
    return x.store(val, std::memory_order_release);
#endif
}

template <class T>
struct SPSC_queue {
  using size_type = unsigned;
  using value_type = T;
  static const size_type size = 1024;

  std::atomic<size_type> head;
  value_type valueArr[size];
  std::atomic<size_type> tail;  // in a separate cache-line from head to reduce contention

  bool push(const value_type &value)
  {
    // Check for room
    const size_type currentTail = tail.load(std::memory_order_relaxed);  // no other writers to tail, no ordering needed
    const size_type nextTail = currentTail + 1;    // modulo separately so empty and full are distinguishable.
    if (nextTail == load_acquire(head))
        return false;

    valueArr[currentTail % size] = value;
    store_release(tail, nextTail);
    return true;
  }
};

// instantiate the template for  int  so we can look at the asm
template bool SPSC_queue<int>::push(const value_type &value);

Compiles cleanly on the Godbolt compiler explorer with zero barriers if you use -DUNIPROCESSOR , with g++9.2 -O3 -mcpu=cortex-a15 (just to pick a random modern-ish ARM core so GCC can inline std::atomic load/store function and barriers for the non-uniprocessor case.如果您使用-DUNIPROCESSOR -DUNIPROCESSOR g++9.2 -O3 -mcpu=cortex-a15 （只是为了选择一个随机的现代风格 ARM 内核，那么 GCC 可以在std::atomic中加载存储 function 和非单处理器情况的屏障。

ARM 上的无锁 SPSC 队列实现

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-05-30 01:47:25

untested implementation, compiles to asm that looks ok but no other testing未经测试的实现，编译为看起来不错但没有其他测试的 asm

ARM 上的无锁 SPSC 队列实现

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-05-30 01:47:25

untested implementation, compiles to asm that looks ok but no other testing未经测试的实现，编译为看起来不错但没有其他测试的 asm

解决方案1
2 已采纳 2020-05-30 01:47:25