A readers/writer lock… without having a lock for the readers?

Question

I get the feeling this may be a very general and common situation for which a well-known no-lock solution exists.

In a nutshell, I'm hoping there's approach like a readers/writer lock, but that doesn't require the readers to acquire a lock and thus can be better average performance.

Instead there'd be some atomic operations (128-bit CAS) for a reader, and a mutex for a writer. I'd have two copies of the data structure, a read-only one for the normally-successful queries, and an identical copy to be update under mutex protection. Once the data has been inserted into the writable copy, we make it the new readable copy. The old readable copy then gets inserted in turn, once all the pending readers have finished reading it, and the writer spins on the number of readers left until its zero, then modifies it in turn, and finally releases the mutex.

Or something like that.

Anything along these lines exist?

Answer 1

If your data fits in a 64-bit value, most systems can cheaply read/write that atomically, so just use std::atomic<my_struct> .

For smallish and/or infrequently-written data , there are a couple ways to make readers truly read-only on the shared data, not having to do any atomic RMW operations on a shared counter or anything. This allows read-side scaling to many threads without readers contending with each other (unlike a 128-bit atomic read on x86 using lock cmpxchg16b , or taking a RWlock).

Ideally just an extra level of indirection via an atomic<T*> pointer (RCU), or just an extra load + compare-and-branch (SeqLock); no atomic RMWs or memory barriers stronger than acq/rel or anything else in the read side.

This can be appropriate for data that's read very frequently by many threads, eg a timestamp updated by a timer interrupt but read all over the place. Or a config setting that typically never changes.

If your data is larger and/or changes more frequently, one of the strategies suggested in other answers that requires a reader to still take a RWlock on something or atomically increment a counter will be more appropriate. This won't scale perfectly because each reader still needs to get exclusive ownership of the shared cache line containing lock or counter so it can modify it, but there's no such thing as a free lunch.

RCU

It sounds like you're half-way to inventing RCU (Read Copy Update) where you update a pointer to the new version.

But remember a lock-free reader might stall after loading the pointer, so you have a deallocation problem. This is the hard part of RCU. In a kernel it can be solved by having sync points where you know that there are no readers older than some time t, and thus can free old versions. There are some user-space implementations. https://en.wikipedia.org/wiki/Read-copy-update and https://lwn.net/Articles/262464/ .

For RCU, the less frequent the changes, the larger a data structure you can justify copying. eg even a moderate-sized tree could be doable if it's only ever changed interactively by an admin, while readers are running on dozens of cores all checking something in parallel. eg kernel config settings are one thing where RCU is great in Linux.

SeqLock

If your data is small (eg a 64-bit timestamp on a 32-bit machine), another good option is a SeqLock. Readers check a sequence counter before/after non-atomic copy of the data into a private buffer. If the sequence counters match, we know there wasn't tearing. (Writers mutually exclude each with a separate mutex). Implementing 64 bit atomic counter with 32 bit atomics / how to implement a seqlock lock using c++11 atomic library .

It's a bit of a hack in C++ to write something that can compile efficiently to a non-atomic copy that might have tearing, because inevitably that's data-race UB. (Unless you use std::atomic<long> with mo_relaxed for each chunk separately, but then you're defeating the compiler from using movdqu or something to copy 16 bytes at once.)

A SeqLock makes the reader copy the whole thing (or ideally just load it into registers) every read so it's only ever appropriate for a small struct or 128-bit integer or something. But for less than 64 bytes of data it can be quite good, better than having readers use lock cmpxchg16b for a 128-bit datum if you have many readers and infrequent writes.

It's not lock-free, though: a writer that sleeps while modifying the SeqLock could get readers stuck retrying indefinitely. For a small SeqLock the window is small, and obviously you want to have all the data ready before you do the first sequence-counter update to minimize the chance for an interrupt pausing the writer in mid update.

The best case is when there's only 1 writer so it doesn't have to do any locking; it knows nothing else will be modifying the sequence counter.

Answer 2

What you're describing is very similar to double instance locking and left-right concurrency control .

In terms of progress guarantees, the difference between the two is that the former is lock-free for readers while the latter is wait-free. Both are blocking for writers.

Answer 3

It turns out the two-structure solution I was thinking of has similarities to http://concurrencyfreaks.blogspot.com/2013/12/left-right-concurrency-control.html

Here's the specific data structure and pseudocode I had in mind.

We have two copies of some arbitrary data structure called MyMap allocated, and two pointers out of a group of three pointers point to these two. Initially, one is pointed to by achReadOnly[0].pmap and the other by pmapMutable.

A quick note on achReadOnly: it has a normal state and two temporary states. The normal state will be (WLOG for cell 0/1):

achReadOnly = { { pointer to one data structure, number of current readers },
                { nullptr, 0 } }
pmapMutable = pointer to the other data structure

When we've finished mutating "the other," we store it in the unused slot of the array as it is the next-generation read-only and it's fine for readers to start accessing it.

achReadOnly = { { pointer to one data structure, number of old readers },
                { pointer to the other data structure, number of new readers } }
pmapMutable = pointer to the other data structure

The writer then clears the pointer to "the one", the previous-generation readonly, forcing readers to go to the next-generation one. We move that to pmapMutable.

achReadOnly = { { nullptr, number of old readers },
                { pointer to the other data structure, number of new readers } }
pmapMutable = pointer to the one data structure

The writer then spins for number of old readers to hit one (itself) at which point it can receive the same update. That 1 is overwritten with 0 to clean up in preparation to move forward. Though in fact it could be left dirty as it won't be referred to before being overwritten.

struct CountedHandle {
    MyMap*   pmap;
    int      iReaders;
};

// Data Structure:
atomic<CountedHandle> achReadOnly[2];
MyMap* pmapMutable;
mutex_t muxMutable;

data Read( key ) {
    int iWhich = 0;
    CountedHandle chNow, chUpdate;

    // Spin if necessary to update the reader counter on a pmap, and/or
    // to find a pmap (as the pointer will be overwritten with nullptr once
    // a writer has finished updating the mutable copy and made it the next-
    // generation read-only in the other slot of achReadOnly[].

    do {
        chNow = achReadOnly[ iWhich ];
        if ( !chNow .pmap ) {
            iWhich = 1 - iWhich;
            continue;
        }
        chUpdate = chNow;
        chNow.iReaders++;
    } while ( CAS( ach[ iWhich ], chNow, chUpdate ) fails );

    // Now we've found a map, AND registered ourselves as a reader of it atomicly.
    // Importantly, it is impossible any reader has this pointer but isn't
    // represented in that count.

    if ( data = chnow.pmap->Find( key ) ) {
        // Deregister ourselves as a reader.
        do {
            chNow = achReadOnly[ iWhich ];
            chUpdate = chNow;
            chNow.iReaders--;
        } while ( CAS( ach[ iWhich ], chNow, chUpdate ) fails );

        return data;
    }

    // OK, we have to add it to the structure.

    lock muxMutable;
    figure out data for this key
    pmapMutable->Add( key, data );

    // It's now the next-generation read-only.  Put it where readers can find it.
    achReadOnly[ 1 - iWhich ].pmap = pmapMutable;

    // Prev-generation readonly is our Mutable now, though we can't change it
    // until the readers are gone.
    pmapMutable = achReadOnly[ iWhich ].pmap;

    // Force readers to look for the next-generation readonly.
    achReadOnly[ iWhich ].pmap = nullptr;

    // Spin until all readers finish with previous-generation readonly.
    // Remember we added ourselves as reader so wait for 1, not 0.

    while ( achReadOnly[ iWhich ].iReaders > 1 }
        ;

    // Remove our reader count.
    achReadOnly[ iWhich ].iReaders = 0;

    // No more readers for previous-generation readonly, so we can now write to it.
    pmapMutable->Add( key, data );

    unlock muxMutable;

    return data;

}

Answer 4

Solution that has come to me:

Every thread has a thread_local copy of the data structure, and this can be queried at will without locks. Any time you find your data, great, you're done.

If you do NOT find your data, then you acquire a mutex for the master copy.

This will have potentially many new insertions in it from other threads (possibly including the data you need.). Check to see if it has your data and if not insert it.

Finally, copy all the recent updates--including the entry for the data you need--to your own thread_local copy. Release mutex and done.

Readers can read all day long, in parallel, even when updates are happening, without locks . A lock is only needed when writing, (or sometimes when catching up). This general approach would work for a wide range of underlying data structures. QED

Having many thread_local indexes sounds memory-inefficient if you have lots of threads using this structure.

However, the data found by the index, if it's read-only, need only have one copy, referred to by many indices. (Luckily, that is my case.)

Also, many threads might not be randomly accessing the full range of entries; maybe some only need a few entries and will very quickly reach a final state where their local copy of the structure can find all the data needed, before it grows much. And yet many other threads may not refer to this at all. (Luckily, that is my case.)

Finally, to "copy all the recent updates" it'd help if all new data added to the structure were, say, pushed onto the end of a vector so given that say you have 4000 entries in your local copy, the master copy has 4020, you can with a few machine cycles locate the 20 objects you need to add. (Luckily, that is my case.)

A readers/writer lock… without having a lock for the readers?

Question

4 answers

solution1
4 2020-04-15 20:19:15

RCU

SeqLock

solution2
3 ACCPTED 2020-04-15 21:19:12

solution3
1 2020-04-16 08:46:19

solution4
0 2020-04-16 04:52:28

A readers/writer lock… without having a lock for the readers?

Question

4 answers

solution1 4 2020-04-15 20:19:15

RCU

SeqLock

solution2 3 ACCPTED 2020-04-15 21:19:12

solution3 1 2020-04-16 08:46:19

solution4 0 2020-04-16 04:52:28

solution1
4 2020-04-15 20:19:15

solution2
3 ACCPTED 2020-04-15 21:19:12

solution3
1 2020-04-16 08:46:19

solution4
0 2020-04-16 04:52:28