Design of a high-performance sorted data structure read by many threads and written by few

Question

I have an interesting data structure design problem that is beyond my current expertise. I'm seeking data structure or algorithm answers about tackling this problem.

The requirements:

Store a reasonable number of (pointer address, size) pairs (effectively two numbers; the first is useful as a sorting key) in one location
In a highly threaded application, many threads will look up values, to see if a specific pointer is within one of the (address, size) pairs - that is, if the pair defines a memory range, if the pointer is within any range in the list. Threads will much more rarely add or remove entries from this list.
Reading or searching for values must be as fast as possible , happening hundreds of thousands to millions of times a second
Adding or removing values, ie mutating the list, happens much more rarely ; performance is not as important
It is acceptable but not ideal for the list contents to be out of date, ie for a thread's lookup code to not find an entry that should exist, so long as at some point the entry will exist.

I am keen to avoid a naive implementation such as having a critical section to serialize access to a sorted list or tree. What data structures or algorithms might be suitable for this task?

Tagged with Delphi since I am using that language for this task. Language-agnostic answers are very welcome.

However, I probably cannot use any of the standard libraries in any language without a lot of care. The reason is that memory access (allocation, freeing, etc of objects and their internal memory, eg tree nodes, etc) is strictly controlled and must go through my own functions. My current code elsewhere in the same program uses red/black trees and a bit trie, and I've written these myself. Object and node allocation runs through custom memory allocation routines. It's beyond the scope of the question, but is mentioned here to avoid an answer like 'use STL structure foo.' I'm keen for an algorithmic or structure answer that, so long as I have the right references or textbooks, I can implement myself.

Answer 1

I would use a TDictionary<Pointer, Integer> (from Generics.Collections ) combined with a TMREWSync (from SysUtils ) for the multi-read exclusive-write access. TMREWSync allows multiple readers simulatenous access to the dictionary, as long as no writer is active. The dictionary itself provides O(1) lookup of pointers.

If you don't want to use the RTL classes the answer becomes: use a hash map combined with a multi-read exclusive-write synchronization object.

EDIT : Just realized that your pairs really represent memory ranges, so a hash map does not work. In this case you could use a sorted list (sorted by memory adress) and then use binary search to quickly find a matching range. That makes the lookup O(log n) instead of O(1) though.

Answer 2

Exploring a bit the replication idea ...

From the correctness point of view, reader/writer locks will do the work. However, in practice, while readers may be able to proceed concurrently and in parallel with accessing the structure, they will create a huge contention on the lock, for the obvious reason that locking even for read access involves writing to the lock itself. This will kill the performance in a multi-core system and even more in a multi-socket system.

The reason for the low performance is the cache line invalidation/transfer traffic between cores/sockets. (As a side note, here's a very recent and very interesting study on the subject Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask ).

Naturally, we can avoid inter core cache transfers, triggered by readers, by making a copy of the structure on each core and restricting the reader threads to accessing only the copy local to the core they are currently executing. This requires some mechanism for a thread to obtain its current core id. It also relies to on the operating system scheduler to not move gratuitously threads across cores, ie to maintain core affinity to some extent. AFACT, most current operating systems do it.

As for the writers, their job would be to update all the existing replicas, by obtaining each lock for writing. Updating one tree (apparently the structure should be some tree) at a time does mean a temporary inconsistency between replicas. From the problem description this seams acceptable. When a writer works, it will block readers on a single core, but not all readers. The drawback is that a writer has the perform the same work many times - as many time as there are cores or sockets in the system.

PS.

Maybe, just maybe, another alternative is some RCU -like approach, but I don't know this well, so I'll just stop after mentioning it :)

Answer 3

With replication you could have: - one copy of your data structure (list w/ binary search, the interval tree mentioned, ..) (say, the "original" one) that is used only for the lookup (read-access). - A second copy, the "update" one, is created when the data is to be altered (write access). So the write is made to the update copy.

Once writing completes, change some "current"-pointer from the "original" to the "update" version. Involving an access-counter to the "original" copy, this one can be destroyed when the counter decremented back to zero readers.

In pseudo-code:

// read:
data = get4Read();
... do the lookup
release4Read(data);

// write
data = get4Write();
... alter the data
release4Write(data);


// implementation:            
// current is the datat structure + a 'readers' counter, initially set to '0'
get4Read() {
  lock(current_lock) {              // exclusive access to current
    current.readers++;              // one more reader
    return current;
  }
}

release4Read(copy) {
  lock(current_lock) {              // exclusive access to current
   if(0 == --copy.readers) {        // last reader
     if(copy != current) {          // it was the old, "original" one
       delete(copy);                // destroy it
     }
   }
  }
}

get4Write() {

   aquire_writelock(update_lock);  // blocks concurrent writers!

   var copy_from = get4Read(); 
   var copy_to = deep_copy(copy_from);
   copy_to.readers = 0;

   return copy_to;
}    

release4Write(data) {

   lock(current_lock) {              // exclusive access to current
     var copy_from = current;
     current = data; 
   }

   release4Read(copy_from);

   release_writelock(update_lock);  // next write can come
}

To complete the answer regarding the actual data structure to use: Given the fixed size of the data-entries (two integer tuple), also being quite small, i would use an array for storage and binary search for the lookup. (An alternative would be a balanced tree mentioned in the comment).

Talking about performance: As i understand, the 'address' and 'size' define ranges. Thus, the lookup for a given address being within such a range would involve an addition operation of 'address' + 'size' (for comparison of the queried address with the ranges upper bound) over and over again. It may be more performant to store start- and end-adress explicitely, instead of start-adress and size - to avoid this repeated addition.

Answer 4

Read the LMDB design papers at http://symas.com/mdb/ . An MVCC B+tree with lockless reads and copy-on-write writes. Reads are always zero-copy, writes may optionally be zero-copy as well. Can easily handle millions of reads per second in the C implementation. I believe you should be able to use this in your Delphi program without modification, since readers also do no memory allocation. (Writers may do a few allocations, but it's possible to avoid most of them.)

Answer 5

As a side note, here's a good read about memory barriers: Memory Barriers: a Hardware View for Software Hackers

This is just to answer a comment by @fast, the comment space is not big enough ...

@chill: Where do you see the need to place any 'memory barriers'?

Everywhere, where you access shared storage from two different cores.

For example, a writer comes, make a copy of the data and then calls release4Write . Inside release4write , the writer does the assignment current = data , to update the shared pointer with the location of the new data, decrements the counter of the old copy to zero and proceeds with deleting it. Now a reader intervenes and calls get4Read . And inside get4Read it does copy = current . Since there's no memory barrier, this happens to read the old value of current . For all we know, the write may be reordered after the delete call or the new value of current may still reside in the writer's store queue or the reader may not have yet seen and processed a corresponding cache invalidation request and whatnot ... Now the reader happily proceeds to search in that copy of the data that the writer is deleting or has just deleted. Oops!

But, wait, there's more! :D

With propper use if the > get..() and release..() functions, where do you see the problems to access deleted data or multiple deletion?

See the following interleaving of reader and write operations.

Reader                      Shared data               Writer
======                      ===========               ======
                             current = A:0            

data = get4Read()
   var copy = A:0
   copy.readers++;
                             current = A:1
   return A:1
data = A:1
... do the lookup
release4Read(copy == A:1):
    --copy.readers           current = A:0
   0 == copy.readers -> true

                                                      data = get4Write():
                                                           aquire_writelock(update_lock)
                                                           var copy_from = get4Read():
                                                                  var copy = A:0
                                                                  copy.readers++; 
                             current = A:1
                                                                  return A:1
                                                           copy_from == A:1
                                                           var copy_to = deep_copy(A:1);
                                                           copy_to == B:1
                                                           return B:1
                                                      data == B:1
                                                      ... alter the data
                                                      release4Write(data = B:1)
                                                           var copy_from = current;
                                                           copy_form == A:1
                                                           current = B:1
                             current = B:1 
     A:1 != B:1 -> true
     delete A:1
                                                           !!! release4Read(A:1) !!!

And the writer accesses deleted data and then tries to delete it again. Double oops!

Design of a high-performance sorted data structure read by many threads and written by few

Question

5 answers

solution1
3 2013-11-28 13:33:22

solution2
2 2013-11-28 14:32:25

solution3
1 2013-11-28 17:29:09

solution4
1 2014-08-17 20:32:39

solution5
0 2013-11-29 12:06:36

Design of a high-performance sorted data structure read by many threads and written by few

Question

5 answers

solution1 3 2013-11-28 13:33:22

solution2 2 2013-11-28 14:32:25

solution3 1 2013-11-28 17:29:09

solution4 1 2014-08-17 20:32:39

solution5 0 2013-11-29 12:06:36

solution1
3 2013-11-28 13:33:22

solution2
2 2013-11-28 14:32:25

solution3
1 2013-11-28 17:29:09

solution4
1 2014-08-17 20:32:39

solution5
0 2013-11-29 12:06:36