无锁堆栈实现想法-目前已中断

Question

I came up with an idea I am trying to implement for a lock free stack that does not rely on reference counting to resolve the ABA problem, and also handles memory reclamation properly. 我想出了一个想法，我试图实现一个无锁堆栈，该堆栈不依赖引用计数来解决ABA问题，并且还可以正确处理内存回收。 It is similar in concept to RCU, and relies on two features: marking a list entry as removed, and tracking readers traversing the list. 它在概念上与RCU类似，并且依赖于两个功能：将列表条目标记为已删除，以及跟踪遍历列表的读者。 The former is simple, it just uses the LSB of the pointer. 前者很简单，它只使用指针的LSB。 The latter is my "clever" attempt at an approach to implementing an unbounded lock free stack. 后者是我“聪明”地尝试实现一种无界无锁堆栈的方法。

Basically, when any thread attempts to traverse the list, one atomic counter (list.entries) is incremented. 基本上，当任何线程尝试遍历列表时，一个原子计数器（list.entries）都会增加。 When the traversal is complete, a second counter (list.exits) is incremented. 遍历完成后，第二个计数器（list.exits）增加。

Node allocation is handled by push, and deallocation is handled by pop. 节点分配由推处理，而释放则由pop处理。

The push and pop operations are fairly similar to the naive lock-free stack implementation, but the nodes marked for removal must be traversed to arrive at a non-marked entry. 推和弹出操作与天真无锁堆栈实现非常相似，但是标记为删除的节点必须经过遍历才能到达未标记的条目。 Push basically is therefore much like a linked list insertion. 因此，基本上，推入很像链表的插入。

The pop operation similarly traverses the list, but it uses atomic_fetch_or to mark the nodes as removed while traversing, until it reaches a non-marked node. 弹出操作类似地遍历列表，但是它使用atomic_fetch_or将遍历时标记为已删除的节点，直到到达未标记的节点。

After traversing the list of 0 or more marked nodes, a thread that is popping will attempt to CAS the head of the stack. 遍历0个或更多标记节点的列表后，弹出的线程将尝试将CAS的头部设为CAS。 At least one thread concurrently popping will succeed, and after this point all readers entering the stack will no longer see the formerly marked nodes. 至少有一个并发弹出的线程将成功，并且此后所有进入堆栈的读取器将不再看到以前标记的节点。

The thread that successfully updates the list then loads the atomic list.entries, and basically spin-loads atomic.exits until that counter finally exceeds list.entries. 成功更新列表的线程然后加载atomic list.entries，并基本上旋转加载atomic.exits，直到该计数器最终超过list.entries。 This should imply that all readers of the "old" version of the list have completed. 这应该意味着该列表的“旧”版本的所有读者都已完成。 The thread then simply frees the the list of marked nodes that it swapped off the top of the list. 然后，该线程简单地释放它从列表顶部交换过来的标记节点列表。

So the implications from the pop operation should be (I think) that there can be no ABA problem because the nodes that are freed are not returned to the usable pool of pointers until all concurrent readers using them have completed, and obviously the memory reclamation issue is handled as well, for the same reason. 因此，pop操作的含义应该是（我认为）不会存在ABA问题，因为在所有使用它们的并发读取器都已完成之前，释放的节点不会返回到可用的指针池中，这显然是内存回收问题。出于相同的原因也被处理。

So anyhow, that is theory, but I'm still scratching my head on the implementation, because it is currently not working (in the multithreaded case). 因此，无论如何，这只是理论上的问题，但是我仍在抓紧实现，因为它目前不起作用（在多线程情况下）。 It seems like I am getting some write after free issues among other things, but I'm having trouble spotting the issue, or maybe my assumptions are flawed and it just won't work. 似乎我在免费问题之后就得到了一些写作，但是我在发现问题时遇到了麻烦，或者我的假设有缺陷，这根本行不通。

Any insights would be greatly appreciated, both on the concept, and on approaches to debugging the code. 无论是在概念上，还是在调试代码的方法上，任何见解都将不胜感激。

Here is my current (broken) code (compile with gcc -D_GNU_SOURCE -std=c11 -Wall -O0 -g -pthread -o list list.c): 这是我当前的（断开）代码（使用gcc -D_GNU_SOURCE -std = c11 -Wall -O0 -g -pthread -o list list.c进行编译）：

#include <pthread.h>
#include <stdatomic.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>

#include <sys/resource.h>

#include <stdio.h>
#include <unistd.h>

#define NUM_THREADS 8
#define NUM_OPS (1024 * 1024)

typedef uint64_t list_data_t;

typedef struct list_node_t {
    struct list_node_t * _Atomic next;
    list_data_t data;
} list_node_t;

typedef struct {
    list_node_t * _Atomic head;
    int64_t _Atomic size;
    uint64_t _Atomic entries;
    uint64_t _Atomic exits;
} list_t;

enum {
    NODE_IDLE    = (0x0),
    NODE_REMOVED = (0x1 << 0),
    NODE_FREED   = (0x1 << 1),
    NODE_FLAGS    = (0x3),
};

static __thread struct {
    uint64_t add_count;
    uint64_t remove_count;
    uint64_t added;
    uint64_t removed;
    uint64_t mallocd;
    uint64_t freed;
} stats;

#define NODE_IS_SET(p, f) (((uintptr_t)p & f) == f)
#define NODE_SET_FLAG(p, f) ((void *)((uintptr_t)p | f))
#define NODE_CLR_FLAG(p, f) ((void *)((uintptr_t)p & ~f))
#define NODE_POINTER(p) ((void *)((uintptr_t)p & ~NODE_FLAGS))

list_node_t * list_node_new(list_data_t data)
{
    list_node_t * new = malloc(sizeof(*new));
    new->data = data;
    stats.mallocd++;

    return new;
}

void list_node_free(list_node_t * node)
{
    free(node);
    stats.freed++;
}

static void list_add(list_t * list, list_data_t data)
{
    atomic_fetch_add_explicit(&list->entries, 1, memory_order_seq_cst);

    list_node_t * new = list_node_new(data);
    list_node_t * _Atomic * next = &list->head;
    list_node_t * current = atomic_load_explicit(next,  memory_order_seq_cst);
    do
    {
        stats.add_count++;
        while ((NODE_POINTER(current) != NULL) &&
                NODE_IS_SET(current, NODE_REMOVED))
        {
                stats.add_count++;
                current = NODE_POINTER(current);
                next = &current->next;
                current = atomic_load_explicit(next, memory_order_seq_cst);
        }
        atomic_store_explicit(&new->next, current, memory_order_seq_cst);
    }
    while(!atomic_compare_exchange_weak_explicit(
            next, &current, new,
            memory_order_seq_cst, memory_order_seq_cst));

    atomic_fetch_add_explicit(&list->exits, 1, memory_order_seq_cst);
    atomic_fetch_add_explicit(&list->size, 1, memory_order_seq_cst);
    stats.added++;
}

static bool list_remove(list_t * list, list_data_t * pData)
{
    uint64_t entries = atomic_fetch_add_explicit(
            &list->entries, 1, memory_order_seq_cst);

    list_node_t * start = atomic_fetch_or_explicit(
            &list->head, NODE_REMOVED, memory_order_seq_cst);
    list_node_t * current = start;

    stats.remove_count++;
    while ((NODE_POINTER(current) != NULL) &&
            NODE_IS_SET(current, NODE_REMOVED))
    {
        stats.remove_count++;
        current = NODE_POINTER(current);
        current = atomic_fetch_or_explicit(&current->next,
                NODE_REMOVED, memory_order_seq_cst);
    }

    uint64_t exits = atomic_fetch_add_explicit(
            &list->exits, 1, memory_order_seq_cst) + 1;

    bool result = false;
    current = NODE_POINTER(current);
    if (current != NULL)
    {
        result = true;
        *pData = current->data;

        current = atomic_load_explicit(
                &current->next, memory_order_seq_cst);

        atomic_fetch_add_explicit(&list->size,
                -1, memory_order_seq_cst);

        stats.removed++;
    }

    start = NODE_SET_FLAG(start, NODE_REMOVED);
    if (atomic_compare_exchange_strong_explicit(
            &list->head, &start, current,
            memory_order_seq_cst, memory_order_seq_cst))
    {
        entries = atomic_load_explicit(&list->entries, memory_order_seq_cst);
        while ((int64_t)(entries - exits) > 0)
        {
            pthread_yield();
            exits = atomic_load_explicit(&list->exits, memory_order_seq_cst);
        }

        list_node_t * end = NODE_POINTER(current);
        list_node_t * current = NODE_POINTER(start);
        while (current != end)
        {
            list_node_t * tmp = current;
            current = atomic_load_explicit(&current->next, memory_order_seq_cst);
            list_node_free(tmp);
            current = NODE_POINTER(current);
        }
    }

    return result;
}

static list_t list;

pthread_mutex_t ioLock = PTHREAD_MUTEX_INITIALIZER;

void * thread_entry(void * arg)
{
    sleep(2);
    int id = *(int *)arg;

    for (int i = 0; i < NUM_OPS; i++)
    {
        bool insert = random() % 2;

        if (insert)
        {
            list_add(&list, i);
        }
        else
        {
            list_data_t data;
            list_remove(&list, &data);
        }
    }

    struct rusage u;
    getrusage(RUSAGE_THREAD, &u);

    pthread_mutex_lock(&ioLock);
    printf("Thread %d stats:\n", id);
    printf("\tadded = %lu\n", stats.added);
    printf("\tremoved = %lu\n", stats.removed);
    printf("\ttotal added = %ld\n", (int64_t)(stats.added - stats.removed));
    printf("\tadded count = %lu\n", stats.add_count);
    printf("\tremoved count = %lu\n", stats.remove_count);
    printf("\tadd average = %f\n", (float)stats.add_count / stats.added);
    printf("\tremove average = %f\n", (float)stats.remove_count / stats.removed);
    printf("\tmallocd = %lu\n", stats.mallocd);
    printf("\tfreed = %lu\n", stats.freed);
    printf("\ttotal mallocd = %ld\n", (int64_t)(stats.mallocd - stats.freed));
    printf("\tutime = %f\n", u.ru_utime.tv_sec
            + u.ru_utime.tv_usec / 1000000.0f);
    printf("\tstime = %f\n", u.ru_stime.tv_sec
                    + u.ru_stime.tv_usec / 1000000.0f);
    pthread_mutex_unlock(&ioLock);

    return NULL;
}

int main(int argc, char ** argv)
{
    struct {
            pthread_t thread;
            int id;
    }
    threads[NUM_THREADS];
    for (int i = 0; i < NUM_THREADS; i++)
    {
        threads[i].id = i;
        pthread_create(&threads[i].thread, NULL, thread_entry, &threads[i].id);
    }

    for (int i = 0; i < NUM_THREADS; i++)
    {
        pthread_join(threads[i].thread, NULL);
    }

    printf("Size = %ld\n", atomic_load(&list.size));

    uint32_t count = 0;

    list_data_t data;
    while(list_remove(&list, &data))
    {
        count++;
    }
    printf("Removed %u\n", count);
}

Answer 1

You mention you are trying to solve the ABA problem, but the description and code is actually an attempt to solve a harder problem: the memory reclamation problem. 您提到您正在尝试解决ABA问题，但是描述和代码实际上是为了解决一个更棘手的问题：内存回收问题。

This problem typically arises in the "deletion" functionality of lock-free collections implemented in languages without garbage collection. 在无垃圾回收的语言中实现的无锁集合的“删除”功能中通常会出现此问题。 The core issue is that a thread removing a node from a shared structure often doesn't know when it is safe to free the removed node as because other reads may still have a reference to it. 核心问题是，从共享结构中删除节点的线程通常不知道何时可以安全地释放已删除的节点，因为其他读取可能仍然引用该节点。 Solving this problem often, as a side effect, also solves the ABA problem: which is specifically about a CAS operation succeeding even though the underlying pointer (and state of the object) has been been changed at least twice in the meantime, ending up with the original value but presenting a totally different state. 通常，作为一个副作用解决该问题也可以解决ABA问题：这特别是与CAS操作成功有关，即使在此期间基础指针（和对象的状态）已被更改了至少两次，最后还是原始值，但呈现出完全不同的状态。

The ABA problem is easier in the sense that there are several straightforward solutions to the ABA problem specifically that don't lead to a solution to the "memory reclamation" problem. 从某种意义上说，ABA问题比较容易，因为有很多直接解决ABA问题的方法，特别是不会导致解决“内存回收”问题的方法。 It is also easier in the sense that hardware that can detect the modification of the location, eg, with LL/SC or transactional memory primitives, might not exhibit the problem at all. 从某种意义上讲，可以检测到位置修改（例如使用LL / SC或事务性存储原语）的硬件可能根本不会出现此问题，这也更容易。

So that said, you are hunting for a solution to the memory reclamation problem, and it will also avoid the ABA problem. 如此说来，您正在寻找一种解决内存回收问题的方法，它还将避免ABA问题。

The core of your issue is this statement: 您问题的核心是以下声明：

The thread that successfully updates the list then loads the atomic list.entries, and basically spin-loads atomic.exits until that counter finally exceeds list.entries. 成功更新列表的线程然后加载atomic list.entries，并基本上旋转加载atomic.exits，直到该计数器最终超过list.entries。 This should imply that all readers of the "old" version of the list have completed. 这应该意味着该列表的“旧”版本的所有读者都已完成。 The thread then simply frees the the list of marked nodes that it swapped off the top of the list. 然后，该线程简单地释放它从列表顶部交换过来的标记节点列表。

This logic doesn't hold. 这种逻辑不成立。 Waiting for list.exits (you say atomic.exits but I think it's a typo as you only talk about list.exits elsewhere) to be greater than list.entries only tells you there have now been more total exits than there were entries at the point the mutating thread captured the entry count. 等待list.exits （您说atomic.exits，但我认为这是一个错字，因为您只在其他地方谈论list.exits ）大于list.entries告诉您现在的总退出次数已经比列表中的条目多指向变异线程捕获的入口计数。 However, these exits may have been generated by new readers coming and going: it doesn't at all imply that all the old readers have finished as you claim! 但是，这些退出可能是由新读者来来往往所产生的：这并不意味着所有旧读者都已经按照您的要求完成了 ！

Here's a simple example. 这是一个简单的例子。 First a writing thread T1 and a reading thread T2 access the list around the same time, so list.entries is 2 and list.exits is 0. The writing thread pops an node, and saves the current value (2) of list.entries and waits for lists.exits to be greater than 2. Now three more reading threads, T3 , T4 , T5 arrive and do a quick read of the list and leave. 首先，写线程T1和读线程T2大约同时访问列表，因此list.entries为2， list.exits为0。写线程弹出一个节点，并保存list.entries的当前值（2）。并等待lists.exits大于2。现在又有三个读取线程T3 ， T4和T5到达并快速读取list并离开。 Now lists.exits is 3, and your condition is met and T1 frees the node. 现在lists.exits为3，您的条件得到满足， T1释放了该节点。 T2 hasn't gone anywhere though and blows up since it is reading a freed node! T2并没有走到任何地方，因为它正在读取已释放的节点而炸毁了！

The basic idea you have can work, but your two counter approach in particular definitely doesn't work. 您拥有的基本想法是可行的，但是您的两种对立方式绝对是行不通的。

This is a well-studied problem, so you don't have to invent your own algorithm (see the link above), or even write your own code since things like librcu and concurrencykit already exist. 这是一个经过充分研究的问题，因此您无需发明自己的算法（请参见上面的链接），甚至不必编写自己的代码，因为已经存在诸如librcu和concurrencykit之类的东西。

For Educational Purposes 用于教育目的

If you wanted to make this work for educational purposes though, one approach would be to use ensure that threads coming in after a modification have started use a different set of list.entry/exit counters. 但是，如果您想将此内容用于教育目的，一种方法是使用确保修改开始后进入的线程使用另一组list.entry/exit计数器。 One way to do this would be a generation counter, and when the writer wants to modify the list, it increments the generation counter, which causes new readers to switch to a different set of list.entry/exit counters. 一种实现方式是生成计数器，当编写者想要修改列表时，它会增加生成计数器，这会使新的读者切换到另一组list.entry/exit计数器。

Now the writer just has to wait for list.entry[old] == list.exists[old] , which means all the old readers have left. 现在，编写者只需要等待list.entry[old] == list.exists[old] ，这意味着所有旧读者都已离开。 You could also just get away with a single counter per generation: you don't really two entry/exit counters (although it might help reduce contention). 您也可以每代只使用一个计数器：您实际上并不需要两个entry/exit计数器（尽管这可能有助于减少争用）。

Of course, you know have a new problem of managing this list of separate counters per generation... which kind of looks like the original problem of building a lock-free list! 当然，您知道还有一个新问题，即要管理每个世代的独立计数器列表……这看起来像是构建无锁列表的原始问题！ This problem is a bit easier though because you might put some reasonable bound on the number of generations "in flight" and just allocate them all up-front, or you might implement a limited type of lock-free list that is easier to reason about because additions and deletions only occur at the head or tail. 但是，此问题要容易一些，因为您可以对“运行中”的世代数进行合理的限制，然后将它们全部预先分配，或者您可以实现有限类型的无锁列表，这更易于推理因为添加和删除仅发生在头部或尾部。

无锁堆栈实现想法-目前已中断

问题描述

1 个解决方案

解决方案1
9 已采纳 2018-06-11 20:08:02

For Educational Purposes 用于教育目的

无锁堆栈实现想法-目前已中断

问题描述

1 个解决方案

解决方案1 9 已采纳 2018-06-11 20:08:02

For Educational Purposes 用于教育目的

解决方案1
9 已采纳 2018-06-11 20:08:02