简体   繁体   中英

Thousands of reader/writer locks in a single process

I am currently designing a C++ cross-platform (Linux/Windows) server application with a large-scale synchronization pattern. I internally use boost::thread as an abstraction of the OS-specific threads. My problem is to protect an array of data, each element of the array being protected by an independent reader/writer lock .

My array contains 4096 elements . Considering the solution of the "writer-priority readers-writers" problem that is presented in the " Little Book of Semaphores " (page 85), my application would need 5 semaphores per array element. This gives a total of about 20000 semaphores (or, equivalently, 20000 mutexes + 20000 condition variables).

An additional specificity of my application is that a given time, most semaphores are not active (there is typically about 32 "client" threads waiting/signaling on the thousands of semaphores). Note that since the entire server runs in a single process, I use lightweight, thread-based semaphores ( not interprocess semaphores).

My question is twofold:

  1. Is it recommended to create a total of 20000 semaphores on Linux and on Windows for a single process? Well, of course, I guess this is not the case...

  2. If this practice is not recommended, what technique could I use to reduce the number of actual semaphores, eg to create a set of N "emulated semaphores" on the top of 1 actual semaphore ? I suppose that this would be an interesting solution, because most of my semaphores are inactive at a given time.

Thanks in advance!

Summary of the answers so far

  1. Using thousands of semaphores is not recommended , especially from a cross-platform perspective. And so, even if they are not interprocess semaphores (they still consume handles under Windows).
  2. A direct way to solve my problem is to split my array into eg 64 subarrays of 16 elements, and to associate each of these subarrays with a single read/write lock . Unfortunately, this introduces much contention (1 writer would block the reads to 15 elements).
  3. Digging into Boost source code, I have found that:

    • The implementation of "boost::mutex" does not wrap CRITICAL_SECTION objects under Windows (but CreateEvent and ReadWriteBarrier),
    • "boost::shared_mutex" uses CreateSemaphore under Windows (which are heavyweight, interprocess objects) , and
    • "boost::shared_mutex" does not wrap "pthread_rwlock_t" under Linux.

    The reasons for this do not seem clear to me. In particular, the use of interprocess objects for "boost::shared_mutex" under Windows seems sub-optimal to me.

Summary of the open questions so far

  1. How can I create a set of N "emulated semaphores" on the top of 1 actual semaphore, keeping the contention between the emulated semaphores as small as possible?
  2. How do "boost::mutex" and "boost::shared_mutex" compare with their native counterparts (CRITICAL_SECTION and pthread_rwlock_t)?
  1. This is not recommended. You should not do this actually because in Windows it would consume 1 Handle Object per Semaphore. A process can only manage a specific amount of Handles objects. Thread/Process and other Windows objects may need to use Handle objects and will get crashed if they can't. This is similar in Linux with the file-descriptor concept.

  2. Split your 4096 elements into 30 (for example) sets of 140 elements and assign to each 140-group a single Semaphore. Then 30 (in this example) threads will try to access to those 30 sets and they will get sinchronized based on each 140-group-Semaphore.

I'll tell you what I think about it from the Windows perspective. I'm very experienced in writing server applications for Windows.

First, there's absolutely no problem to create 20k semaphores for a single process. It's a pretty lightweight kernel object. Even "inter-process" semaphores.

I see however another issue with your design. You should know that every operation you do on a kernel object (such as semaphore/mutex) involves a heavy kernel-mode transaction (aka system call). Every such a call may cost you around 2k CPU cycles, even if there're no collisions at all.

So that you may find yourself ia situation where most of the processor time is spent in just invocation of the synchronization methods.

In contrary in order to synchronize threads one may use interlocked operations. They cost much less (tens of CPU cycles typically).

There's also an object called critical section . It's a sort of the hybrid of the interlocked operand and a kernel object (which is used if there's an actual collision). You should check for how long do you typically lock your elements. If it's usually a short-duration locks - just use the critical sections, forget about sophisticated read-write locks.

If nevertheless you deal with long-duration locks, and you do need a read-write lock, and you see that you spend a lot of CPU in the kernel-mode transaction - consider creating your own (or try to find an existing) hybrid-like implementation of such a lock.

On Linux you should definitively not implement the locks yourself but use posix_rwlock_t .

Having an array of 4096 such elements shouldn't present any particular problem. POSIX lock structures are implemented quite efficiently in Linux. In particular they use atomic operations when possible on the "fast path" and only go into system calls (notably for a FUTEX) when there is congestion on that particular lock. So if you implement relatively carefully such that any thread only holds 1 or 2 locks at a time, the constraint on Linux would only be given by your total number of worker threads and not by the number of objects themselves.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM