How to fix lock order inversion?

Question

I'm using RAII style locks such as shared_lock and lock_guard but I see that I'm hitting deadlocks.

I want to know why deadlocks happen in this case so I used tsan and tsan found that there is a lock order inversion.

It outputted a stack-trace and it went over my head. I can't seem find what exactly causing the lock order inversion. I however believe it might have to do with functions that takes quite bit time to return. I found that it's bad to lock for long period of time but I have to lock in order to avoid data races. I also thought that it might have to do with the fact the callbacks get invoked asyncally .

Pseudo code

std::unordered_map<size_t, Connection> connections;
std::shared_mutex connectionMapMutex;


void LongRoutine(Connection &connection) {
    // Do work
}

void onRTCDataMessage(RTC::Message message) {
    std::shared_lock guard(connectionMapMutex);

    auto connection = connections.find(message.targetPeer);
 
    if(connection == connections.end()) {
        return;
    }

    LongRoutine(conenction);
}

void onMessage(size_t peer, std::shared_ptr<TUSocket> socket) {
    std::lock_guard<std::shared_mutex> guard(connectionMapMutex);

    auto [element, inserted] = connections.try_emplace(peer);

    auto& connection = element->send;

    {
        // Long routine call
        LongRoutine(connection);

        return;
    }
}

void onDisconnected(size_t peer) {
    std::lock_guard<std::shared_mutex> guard(connectionMapMutex);

    connections.erase(peer);
}

TSan dead-lock stacktrace (Uploaded to pastebin since Stackoverflow limit the size of chars)

https://pastebin.com/raw/SCq2u4Aw

The stacktrace I posted is from my actual application.

Answer 1

Lock inversion occurs when one thread acquires mutex A and then tries to also acquire mutex B, while another thread acquires mutex B and then tries to also acquire mutex A. Both are waiting for the other to release.

The solution is to create a lock hierarchy and always acquire multiple locks in the same order. If the hierarchy is ABCD, and a thread needs to acquire more than one of them at once (say A and D), always acquire them AD, never DA.

Your stack trace indicates the mutexes tagged M585 and M537 violated this rule.

Answer 2

Thanks to Nate Eldredge, I was eventually able to track down the second mystery mutex using GDB backtrace.

I was running several lambda callbacks inside the LongRoutine but in same time re-assigning the Lambda callbacks again to the same Library Instance.

According to the author of the library I'm using requires the callback thread to return before proceeding.

So the solution was making sure the Lambda callbacks only assigned once not twice or more.

How to fix lock order inversion?

Question

2 answers

solution1
0 2021-10-24 15:45:46

solution2
0 ACCPTED 2021-10-27 22:51:06

How to fix lock order inversion?

Question

2 answers

solution1 0 2021-10-24 15:45:46

solution2 0 ACCPTED 2021-10-27 22:51:06

solution1
0 2021-10-24 15:45:46

solution2
0 ACCPTED 2021-10-27 22:51:06