简体   繁体   English

设计由许多线程读取并由少数编写的高性能排序数据结构

[英]Design of a high-performance sorted data structure read by many threads and written by few

I have an interesting data structure design problem that is beyond my current expertise. 我有一个有趣的数据结构设计问题,超出了我目前的专业知识。 I'm seeking data structure or algorithm answers about tackling this problem. 我正在寻找解决这个问题的数据结构或算法答案。

The requirements: 要求:

  • Store a reasonable number of (pointer address, size) pairs (effectively two numbers; the first is useful as a sorting key) in one location 在一个位置存储合理数量的(pointer address, size)对(实际上是两个数字;第一个用作排序键)
  • In a highly threaded application, many threads will look up values, to see if a specific pointer is within one of the (address, size) pairs - that is, if the pair defines a memory range, if the pointer is within any range in the list. 在高度线程化的应用程序中,许多线程将查找值,以查看特定指针是否在(address, size)对之一内 - 即,如果该对定义了内存范围,如果指针位于任何范围内名单。 Threads will much more rarely add or remove entries from this list. 线程将更少地添加或删除此列表中的条目。
  • Reading or searching for values must be as fast as possible , happening hundreds of thousands to millions of times a second 读取或搜索值必须尽可能快 ,每秒发生数十万到数百万次
  • Adding or removing values, ie mutating the list, happens much more rarely ; 添加或删除值,即改变列表,更少发生 ; performance is not as important 表现并不那么重要
  • It is acceptable but not ideal for the list contents to be out of date, ie for a thread's lookup code to not find an entry that should exist, so long as at some point the entry will exist. 列表内容过时是可接受但不理想的,即线程的查找代码找不到应该存在的条目,只要在某些时候条目将存在。

I am keen to avoid a naive implementation such as having a critical section to serialize access to a sorted list or tree. 我希望避免一个天真的实现,例如有一个关键部分来序列化对排序列表或树的访问。 What data structures or algorithms might be suitable for this task? 哪些数据结构或算法可能适合此任务?


Tagged with Delphi since I am using that language for this task. 用Delphi标记,因为我正在使用该语言执行此任务。 Language-agnostic answers are very welcome. 语言无关的答案非常受欢迎。

However, I probably cannot use any of the standard libraries in any language without a lot of care. 但是,我可能无法使用任何语言的任何标准库而不需要太多关心。 The reason is that memory access (allocation, freeing, etc of objects and their internal memory, eg tree nodes, etc) is strictly controlled and must go through my own functions. 原因是内存访问(对象的分配,释放等及其内部存储器,例如树节点等)受到严格控制, 必须通过我自己的功能。 My current code elsewhere in the same program uses red/black trees and a bit trie, and I've written these myself. 我在同一程序中其他地方的当前代码使用红/黑树和一点点特里,我自己写了这些。 Object and node allocation runs through custom memory allocation routines. 对象和节点分配通过自定义内存分配例程运行。 It's beyond the scope of the question, but is mentioned here to avoid an answer like 'use STL structure foo.' 这超出了问题的范围,但这里提到的是为了避免像'使用STL结构foo'这样的答案。 I'm keen for an algorithmic or structure answer that, so long as I have the right references or textbooks, I can implement myself. 我热衷于算法或结构答案,只要我有正确的参考或教科书,我就可以实现自己。

I would use a TDictionary<Pointer, Integer> (from Generics.Collections ) combined with a TMREWSync (from SysUtils ) for the multi-read exclusive-write access. 我会用一个TDictionary<Pointer, Integer> (来自Generics.Collections )具有组合TMREWSync (从SysUtils )用于多读独占写存取。 TMREWSync allows multiple readers simulatenous access to the dictionary, as long as no writer is active. 只要没有TMREWSync器处于活动状态, TMREWSync允许多个读者同时访问字典。 The dictionary itself provides O(1) lookup of pointers. 字典本身提供指针的O(1)查找。

If you don't want to use the RTL classes the answer becomes: use a hash map combined with a multi-read exclusive-write synchronization object. 如果您不想使用RTL类,则答案变为:使用散列映射与多读取独占写入同步对象相结合。

EDIT : Just realized that your pairs really represent memory ranges, so a hash map does not work. 编辑 :刚刚意识到你的对真的代表了内存范围,所以哈希映射不起作用。 In this case you could use a sorted list (sorted by memory adress) and then use binary search to quickly find a matching range. 在这种情况下,您可以使用排序列表(按内存地址排序),然后使用二进制搜索快速查找匹配范围。 That makes the lookup O(log n) instead of O(1) though. 这使得查找O(log n)而不是O(1)。

Exploring a bit the replication idea ... 探索一下复制的想法......

From the correctness point of view, reader/writer locks will do the work. 从正确的角度来看,读/写锁将完成工作。 However, in practice, while readers may be able to proceed concurrently and in parallel with accessing the structure, they will create a huge contention on the lock, for the obvious reason that locking even for read access involves writing to the lock itself. 然而,在实践中,虽然读者可能能够同时并且与访问结构并行地进行,但是它们将在锁上产生巨大的争用,因为显而易见的原因是即使对于读取访问而言锁定也涉及写入锁本身。 This will kill the performance in a multi-core system and even more in a multi-socket system. 这将破坏多核系统中的性能,甚至更多的是在多插槽系统中。

The reason for the low performance is the cache line invalidation/transfer traffic between cores/sockets. 性能低的原因是核心/套接字之间的高速缓存行无效/传输流量。 (As a side note, here's a very recent and very interesting study on the subject Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask ). (作为旁注,这是一个非常近期且非常有趣的关于这个主题的研究, 你总是想知道关于同步但却不敢提出的问题 )。

Naturally, we can avoid inter core cache transfers, triggered by readers, by making a copy of the structure on each core and restricting the reader threads to accessing only the copy local to the core they are currently executing. 当然,我们可以通过在每个核心上复制结构并限制读取器线程仅访问它们当前正在执行的核心的本地副本来避免由读取器触发的核心间高速缓存传输。 This requires some mechanism for a thread to obtain its current core id. 这需要一些线程获取其当前核心ID的机制。 It also relies to on the operating system scheduler to not move gratuitously threads across cores, ie to maintain core affinity to some extent. 它还依赖于操作系统调度程序,不会在核心上无偿地移动线程,即在某种程度上保持核心亲和力。 AFACT, most current operating systems do it. AFACT,大多数当前的操作系统都是这样做的。

As for the writers, their job would be to update all the existing replicas, by obtaining each lock for writing. 对于编写者来说,他们的工作是通过获取每个写入锁来更新所有现有的副本。 Updating one tree (apparently the structure should be some tree) at a time does mean a temporary inconsistency between replicas. 一次更新一棵树(显然结构应该是一些树)确实意味着复制品之间的临时不一致。 From the problem description this seams acceptable. 从问题描述中可以接受这种接缝。 When a writer works, it will block readers on a single core, but not all readers. 当作家工作时,它会阻止读者在一个核心上,但不是所有读者。 The drawback is that a writer has the perform the same work many times - as many time as there are cores or sockets in the system. 缺点是编写器多次执行相同的工作 - 与系统中的内核或插槽一样多的时间。

PS. PS。

Maybe, just maybe, another alternative is some RCU -like approach, but I don't know this well, so I'll just stop after mentioning it :) 也许,只是也许,另一种选择是一些类似RCU的方法,但我不太清楚,所以我会在提到它之后停下来:)

With replication you could have: - one copy of your data structure (list w/ binary search, the interval tree mentioned, ..) (say, the "original" one) that is used only for the lookup (read-access). 通过复制,您可以: - 数据结构的一个副本(列表w /二进制搜索,提到的间隔树,...)(例如,“原始”一个)仅用于查找(读取访问)。 - A second copy, the "update" one, is created when the data is to be altered (write access). - 当要更改数据(写访问)时,创建第二个副本“更新”。 So the write is made to the update copy. 因此写入更新副本。

Once writing completes, change some "current"-pointer from the "original" to the "update" version. 写入完成后,将“当前”指针从“原始”更改为“更新”版本。 Involving an access-counter to the "original" copy, this one can be destroyed when the counter decremented back to zero readers. 将访问计数器与“原始”副本相关联,当计数器减少回零读取器时,可以销毁该副本。

In pseudo-code: 在伪代码中:

// read:
data = get4Read();
... do the lookup
release4Read(data);

// write
data = get4Write();
... alter the data
release4Write(data);


// implementation:            
// current is the datat structure + a 'readers' counter, initially set to '0'
get4Read() {
  lock(current_lock) {              // exclusive access to current
    current.readers++;              // one more reader
    return current;
  }
}

release4Read(copy) {
  lock(current_lock) {              // exclusive access to current
   if(0 == --copy.readers) {        // last reader
     if(copy != current) {          // it was the old, "original" one
       delete(copy);                // destroy it
     }
   }
  }
}

get4Write() {

   aquire_writelock(update_lock);  // blocks concurrent writers!

   var copy_from = get4Read(); 
   var copy_to = deep_copy(copy_from);
   copy_to.readers = 0;

   return copy_to;
}    

release4Write(data) {

   lock(current_lock) {              // exclusive access to current
     var copy_from = current;
     current = data; 
   }

   release4Read(copy_from);

   release_writelock(update_lock);  // next write can come
}

To complete the answer regarding the actual data structure to use: Given the fixed size of the data-entries (two integer tuple), also being quite small, i would use an array for storage and binary search for the lookup. 要完成关于要使用的实际数据结构的答案:给定数据条目的固定大小(两个整数元组),也非常小,我将使用数组进行存储,并使用二进制搜索进行查找。 (An alternative would be a balanced tree mentioned in the comment). (另一种选择是评论中提到的平衡树)。

Talking about performance: As i understand, the 'address' and 'size' define ranges. 谈论绩效:据我所知,'地址'和'大小'定义了范围。 Thus, the lookup for a given address being within such a range would involve an addition operation of 'address' + 'size' (for comparison of the queried address with the ranges upper bound) over and over again. 因此,对这个范围内的给定地址的查找将涉及一遍又一遍地对“地址”+“大小”(用于比较所查询的地址与范围上限)的加法运算。 It may be more performant to store start- and end-adress explicitely, instead of start-adress and size - to avoid this repeated addition. 为了避免这种重复添加,明确地存储开始和结束地址而不是开始地址和大小可能更有效。

Read the LMDB design papers at http://symas.com/mdb/ . 阅读http://symas.com/mdb/上的LMDB设计文章。 An MVCC B+tree with lockless reads and copy-on-write writes. 具有无锁读取和写时复制写入的MVCC B +树。 Reads are always zero-copy, writes may optionally be zero-copy as well. 读取始终为零复制,写入也可以选择为零复制。 Can easily handle millions of reads per second in the C implementation. 在C实现中可以轻松处理每秒数百万次读取。 I believe you should be able to use this in your Delphi program without modification, since readers also do no memory allocation. 我相信你应该可以在你的Delphi程序中使用它而不需要修改,因为读者也没有内存分配。 (Writers may do a few allocations, but it's possible to avoid most of them.) (作家可能会做一些分配,但可以避免大部分分配。)

As a side note, here's a good read about memory barriers: Memory Barriers: a Hardware View for Software Hackers 作为旁注,这里有一个关于内存障碍的好读物: 内存障碍:软件黑客的硬件视图


This is just to answer a comment by @fast, the comment space is not big enough ... 这只是为了回答@fast的评论,评论空间不够大......

@chill: Where do you see the need to place any 'memory barriers'? @chill:你认为有必要设置任何“记忆障碍”吗?

Everywhere, where you access shared storage from two different cores. 无处不在,您可以从两个不同的核心访问共享存储。

For example, a writer comes, make a copy of the data and then calls release4Write . 例如, release4Write来,制作数据的副本,然后调用release4Write Inside release4write , the writer does the assignment current = data , to update the shared pointer with the location of the new data, decrements the counter of the old copy to zero and proceeds with deleting it. release4write内部, release4write器执行赋值current = data ,使用新数据的位置更新共享指针,将旧副本的计数器减少为零并继续删除它。 Now a reader intervenes and calls get4Read . 现在,读者介入并调用get4Read And inside get4Read it does copy = current . get4Read里面它copy = current Since there's no memory barrier, this happens to read the old value of current . 由于没有内存屏障,这恰好会读取current的旧值。 For all we know, the write may be reordered after the delete call or the new value of current may still reside in the writer's store queue or the reader may not have yet seen and processed a corresponding cache invalidation request and whatnot ... Now the reader happily proceeds to search in that copy of the data that the writer is deleting or has just deleted. 据我们所知,写入可以在删除调用之后重新排序,或者current的新值可能仍然驻留在编写器的存储队列中,或者读者可能还没有看到并处理相应的缓存失效请求等等......现在,读者乐意继续搜索作者正在删除或刚刚删除的数据副本。 Oops! 哎呀!

But, wait, there's more! 但是,等等,还有更多! :D :d

With propper use if the > get..() and release..() functions, where do you see the problems to access deleted data or multiple deletion? 使用propper如果> get ..()和release ..()函数,在哪里可以看到访问已删除数据或多次删除的问题?

See the following interleaving of reader and write operations. 请参阅以下读取和写入操作的交错。

Reader                      Shared data               Writer
======                      ===========               ======
                             current = A:0            

data = get4Read()
   var copy = A:0
   copy.readers++;
                             current = A:1
   return A:1
data = A:1
... do the lookup
release4Read(copy == A:1):
    --copy.readers           current = A:0
   0 == copy.readers -> true

                                                      data = get4Write():
                                                           aquire_writelock(update_lock)
                                                           var copy_from = get4Read():
                                                                  var copy = A:0
                                                                  copy.readers++; 
                             current = A:1
                                                                  return A:1
                                                           copy_from == A:1
                                                           var copy_to = deep_copy(A:1);
                                                           copy_to == B:1
                                                           return B:1
                                                      data == B:1
                                                      ... alter the data
                                                      release4Write(data = B:1)
                                                           var copy_from = current;
                                                           copy_form == A:1
                                                           current = B:1
                             current = B:1 
     A:1 != B:1 -> true
     delete A:1
                                                           !!! release4Read(A:1) !!!

And the writer accesses deleted data and then tries to delete it again. 然后编写器访问已删除的数据,然后再次尝试删除它。 Double oops! 双哟!

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 针对兰德流的高性能缓冲 - High-performance buffering for a stream of rands 一个线程在不被读取的同时写入数据,许多线程在不被写入的同时读取数据 - One thread writes data while not being read, many threads read while data is not being written Java:设计用于使用许多执行程序服务和仅几个线程 - Java: design for using many executors services and only few threads 内存锁确保共享数据可以被多个线程读取,但只能被一个线程写入? - Memory lock to ensure shared data can be read from by many threads, but only written to by one? linux下的TCP / UDP高性能服务器 - TCP/UDP high-performance server under linux 许多线程或尽可能少的线程? - Many threads or as few threads as possible? 运行许多IO线程是否会影响一些CPU密集型线程的性能? - Does running many IO threads impact performance of a few CPU-intensive threads? 理想情况下,如果在多个线程中读取变量但仅在一个线程中写入变量,那么是否应该在写线程中以非原子方式读取变量? - Optimally, if a variable is read in many threads but only written in one, should it be read non-atomically in the writing thread? 在高性能套接字上使用Async-Await vs ThreadPool与MultiThreading(C10k解决方案?) - Async-Await vs ThreadPool vs MultiThreading on High-Performance Sockets (C10k Solutions?) Java中用于线程自动化的线程安全高性能矩阵状容器? - Thread safe high-performance matrix-like container in Java for cellular automation?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM