简体   繁体   English

使用std :: atomic和std :: condition_variable,Sync是不可靠的

[英]Sync is unreliable using std::atomic and std::condition_variable

In a distributed job system written in C++11 I have implemented a fence (ie a thread outside the worker thread pool may ask to block until all currently scheduled jobs are done) using the following structure: 在用C ++ 11编写的分布式作业系统中,我使用以下结构实现了一个fence(即工作线程池外部的线程可能要求阻塞,直到完成所有当前计划的作业):

struct fence
{
    std::atomic<size_t>                     counter;
    std::mutex                              resume_mutex;
    std::condition_variable                 resume;

    fence(size_t num_threads)
        : counter(num_threads)
    {}
};

The code implementing the fence looks like this: 实现fence的代码如下所示:

void task_pool::fence_impl(void *arg)
{
    auto f = (fence *)arg;
    if (--f->counter == 0)      // (1)
        // we have zeroed this fence's counter, wake up everyone that waits
        f->resume.notify_all(); // (2)
    else
    {
        unique_lock<mutex> lock(f->resume_mutex);
        f->resume.wait(lock);   // (3)
    }
}

This works very well if threads enter the fence over a period of time. 如果线程在一段时间内进入围栏,这种方法非常有效。 However, if they try to do it almost simultaneously, it seems to sometimes happen that between the atomic decrementation (1) and starting the wait on the conditional var (3), the thread yields CPU time and another thread decrements the counter to zero (1) and fires the cond. 然而,如果他们几乎同时尝试这样做,似乎有时会发生在原子递减(1)和开始条件var(3)的等待之间,线程产生CPU时间而另一个线程将计数器递减到零( 1)并解雇cond。 var (2). var(2)。 This results in the previous thread waiting forever in (3), because it starts waiting on it after it has already been notified. 这导致前一个线程在(3)中永远等待,因为它已经被通知后开始等待它。

A hack to make the thing workable is to put a 10 ms sleep just before (2), but that's unacceptable for obvious reasons. 让事情变得可行的黑客就是在(2)之前进行10毫秒的睡眠,但这显然是不可接受的。

Any suggestions on how to fix this in a performant way? 关于如何以高效的方式解决这个问题的任何建议?

Your diagnose is correct, this code is prone to lose condition notifications in the way you described. 您的诊断是正确的,此代码很容易以您描述的方式丢失条件通知。 Ie after one thread locked the mutex but before waiting on the condition variable another thread may call notify_all() so that the first thread misses that notification. 即在一个线程锁定互斥锁之后但在等待条件变量之前,另一个线程可能会调用notify_all(),以便第一个线程错过该通知。

A simple fix is to lock the mutex before decrementing the counter and while notifying: 一个简单的解决方法是在递减计数器之前锁定互斥锁,同时通知:

void task_pool::fence_impl(void *arg)
{
    auto f = static_cast<fence*>(arg);
    std::unique_lock<std::mutex> lock(f->resume_mutex);
    if (--f->counter == 0) {
        f->resume.notify_all();
    }
    else do {
        f->resume.wait(lock);
    } while(f->counter);
}

In this case the counter need not be atomic. 在这种情况下,计数器不必是原子的。

An added bonus (or penalty, depending on the point of view) of locking the mutex before notifying is (from here ): 在通知之前锁定互斥锁的额外奖励(或惩罚,取决于观点)是(从这里 ):

The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; pthread_cond_broadcast()或pthread_cond_signal()函数可以由线程调用,无论它当前是否拥有调用pthread_cond_wait()或pthread_cond_timedwait()的线程在等待期间与条件变量相关联的互斥锁; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal() . 但是, 如果需要可预测的调度行为,则该互斥锁应由调用pthread_cond_broadcast()或pthread_cond_signal()的线程锁定

Regarding the while loop (from here ): 关于while循环(从这里 ):

Spurious wakeups from the pthread_cond_timedwait() or pthread_cond_wait() functions may occur. 可能会发生pthread_cond_timedwait()或pthread_cond_wait()函数的虚假唤醒。 Since the return from pthread_cond_timedwait() or pthread_cond_wait() does not imply anything about the value of this predicate, the predicate should be re-evaluated upon such return. 由于从pthread_cond_timedwait()或pthread_cond_wait()返回并不意味着有关此谓词的值的任何内容,因此应在返回时重新评估谓词。

In order to keep the higher performance of an atomic operation instead of a full mutex, you should change the wait condition into a lock, check and loop. 为了保持原子操作的更高性能而不是完整的互斥锁,您应该将等待条件更改为锁定,检查和循环。

All condition waits should be done in that way. 所有条件等待都应该以这种方式完成。 The condition variable even has a 2nd argument to wait which is a predicate function or lambda. 条件变量甚至有一个等待的第二个参数,它是一个谓词函数或lambda。

The code might look like: 代码可能如下所示:

void task_pool::fence_impl(void *arg)
{
    auto f = (fence *)arg;
    if (--f->counter == 0)      // (1)
        // we have zeroed this fence's counter, wake up everyone that waits
        f->resume.notify_all(); // (2)
    else
    {
        unique_lock<mutex> lock(f->resume_mutex);
        while(f->counter) {
            f->resume.wait(lock);   // (3)
        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM