为什么我的无锁消息队列是segfault :(？

Question

As a purely mental exercise I'm trying to get this to work without locks or mutexes. 作为一个纯粹的心理练习，我试图让它在没有锁或互斥体的情况下工作。 The idea is that when the consumer thread is reading/executing messages it atomically swaps which std::vector the producer thread uses for writes. 这个想法是当消费者线程正在读取/执行消息时，它以原子方式交换生产者线程用于写入的std::vector 。 Is this possible? 这可能吗？ I've tried playing around with thread fences to no avail. 我试过玩线程围栏无济于事。 There's a race condition here somewhere because it occasionally seg faults. 这里有竞争条件，因为它偶尔会出现故障。 I imagine it's somewhere in the enqueue function. 我想它在enqueue函数中的某个地方。 Any ideas? 有任何想法吗？

// should execute functions on the original thread
class message_queue {
public:
    using fn = std::function<void()>;
    using queue = std::vector<fn>;

    message_queue() : write_index(0) {
    }

    // should only be called from consumer thread
    void run () {
        // atomically gets the current pending queue and switches it with the other one
        // for example if we're writing to queues[0], we grab a reference to queue[0]
        // and tell the producer to write to queues[1]
        queue& active = queues[write_index.fetch_xor(1)];
        // skip if we don't have any messages
        if (active.size() == 0) return;
        // run all messages/callbacks
        for (auto fn : active) {
            fn();
        }
        // clear the active queue so it can be re-used
        active.clear();
        // swap active and pending threads
        write_index.fetch_xor(1);
    }
    void enqueue (fn value) {
        // loads the current pending queue and append some work
        queues[write_index.load()].push_back(value);
    }
private:
    queue queues[2];
    std::atomic<bool> is_empty; // unused for now
    std::atomic<int> write_index;


};
int main(int argc, const char * argv[])
{

    message_queue queue{};
    // flag to stop the message loop
    // doesn't actually need to be atomic because it's only read/wrote on the main thread
    std::atomic<bool> done(false);
    std::thread worker([&queue, &done] {
        int count = 100;
        // send 100 messages
        while (--count) {
            queue.enqueue([count] {
                // should be executed in the main thread
                std::cout << count << "\n";
            });
        }
        // finally tell the main thread we're done
        queue.enqueue([&] {
            std::cout << "done!\n";
            done = true;
        });
    });
    // run messages until the done flag is set
    while(!done) queue.run();
    worker.join();
}

Answer 1

if I understand your code correctly, there are data races , eg: 如果我正确理解你的代码，那就有数据竞争 ，例如：

// producer
int r0 = write_index.load(); // r0 == 0

// consumer
int r1 = write_index.fetch_xor(1); // r1 == 0
queue& active = queues[r1];
active.size();

// producer
queue[r0].push_back(...);

Now both threads access the same queue at the same time. 现在两个线程同时访问同一个队列。 That's a data race , and that means undefined behaviour . 这是一场数据竞赛 ，这意味着未定义的行为 。

Answer 2

Your lock-free queue fails to work because you did not start with at least a semi-formal proof of correctness, then turn that proof into an algorithm with the proof being the primary text, comments connecting the proof to the code, all interconnected with the code. 您的无锁队列无法正常工作，因为您没有从至少一个半正式的正确性证明开始，然后将该证明转换为一个算法，其中证明是主要文本，注释将证明连接到代码，所有这些都与代码。

Unless you are copy/pasting someone else's implementation who did do that, any attempt to write a lock-free algorithm will fail. 除非你是复制/粘贴别人的落实谁做这样做，任何试图写一个无锁算法将失败。 If you are copy-pasting someone else's implementation, please provide it. 如果您正在复制粘贴其他人的实施，请提供。

Lock free algorithms are not robust unless you have such a proof that they are correct, because the kind of errors that make them fail are subtle, and extreme care must be taken. 除非你有这样的证据证明它们是正确的，否则无锁算法并不健全，因为使它们失败的错误类型是微妙的，必须格外小心。 Simply "rolling" a lock free algorithm, even if it fails to result in apparent problems during testing, is a recipe for unreliable code. 简单地“滚动”无锁算法，即使它在测试期间未能导致明显的问题，也是不可靠代码的处方。

One way to get around writing a formal proof in this kind of situation is to track down someone who has written proven correct pseudo code or the like. 在这种情况下编写正式证明的一种方法是追踪已经证明正确的伪代码等的人。 Sketch out the pseudo code, together with the proof of correctness, in comments. 在评论中勾画出伪代码以及正确性证明。 Then fill in the code in the holes. 然后填写孔中的代码。

In general, proving an "almost correct" lock free algorithm is flawed is harder than writing a solid proof that a lock free algorithm is correct if implemented in a particular way, then implementing it. 一般来说，证明一个“几乎正确”的无锁算法存在缺陷比写一个无锁算法的正确证据更难，如果以特定方式实现，然后实现它。 Now, if your algorithm is so flawed that it is easy to find the flaws, then you aren't showing a basic understanding of the problem domain. 现在，如果您的算法存在缺陷，很容易找到缺陷，那么您就不会对问题域有基本的了解。

In short, by posting "why is my algorithm wrong", you are approaching how to write lock free algorithms incorrectly. 简而言之，通过发布“为什么我的算法错误”，您正在接近如何错误地编写无锁算法。 "Where is the flaw in my proof?", "I proved this pseudo-code correct here , and then I implemented it, why do my tests show deadlocks?" “我的证明中的缺陷在哪里？”，“我在这里证明了这个伪代码是正确的，然后我实现了它，为什么我的测试显示出死锁？” are good lock-free questions. 是很好的无锁问题。 "Here is a bunch of code with comments that merely describe what the next line of code does, and no comments describing why I do the next line of code, or how that line of code maintains my lock-free invariants" is not a good lock-free question. “这里有一堆带有注释的代码，仅仅描述了下一行代码的作用，没有描述我为什么要编写下一行代码的注释，或者这行代码如何保持我的无锁不变量”不是一个好的无锁问题。

Step back. 退后。 Find some proven-correct algorithms. 找到一些经过验证的正确算法。 Learn how the proof work. 了解证明如何运作。 Implement some proven correct algorithms via monkey-see monkey-do. 通过猴子实现一些经过验证的正确算法 - 参见monkey-do。 Look at the footnotes to note the issues their proof overlooked (like AB issues). 查看脚注，注意他们的证据被忽略的问题（如AB问题）。 After you have a bunch of those under your belt, try a variation, and do the proof, and check the proof, and do the implementation, and check the implementation. 在您掌握了一堆之后，尝试一个变体，并进行校对，检查证明，并执行实施，并检查实施情况。

为什么我的无锁消息队列是segfault :(？

问题描述

2 个解决方案

解决方案1
4 2014-03-11 23:02:41

解决方案2
3 已采纳 2014-03-12 01:36:09

为什么我的无锁消息队列是segfault :(？

问题描述

2 个解决方案

解决方案1 4 2014-03-11 23:02:41

解决方案2 3 已采纳 2014-03-12 01:36:09

解决方案1
4 2014-03-11 23:02:41

解决方案2
3 已采纳 2014-03-12 01:36:09