简体繁体 English

Lucene 索引建模 - 为什么使用跳过列表而不是 btree？

[英]Lucene index modeling - Why are skiplists used instead of btree?

原文 2021-03-25 17:18:42 6 1 data-structures/ lucene/ skip-lists

I have recently started learning lucene and came to know about how lucene stores and queries indices.我最近开始学习 lucene 并了解 lucene 如何存储和查询索引。 Lucene seems to be using skip list as an underlying data structure. Lucene 似乎使用跳过列表作为基础数据结构。 However, I did not find any reason to use skip list over a binary tree.但是，我没有找到任何理由在二叉树上使用跳过列表。

The advantage with skip lists is that it provides good performance when being used concurrently.跳过列表的优点是它在同时使用时提供了良好的性能。 And lucene allows single writer thread per index and readers read from immutable segments, so skip list is not helping here either.并且 lucene 允许每个索引的单个写入线程和读取不可变段的读取，因此跳过列表在这里也没有帮助。 Other than that binary tree (self balancing) trumps skip list - since it provides worst case complexity of O(logn) for reading and writing whereas skip list provides same time complexity in average case.除了二叉树（自平衡）胜过跳过列表 - 因为它为读取和写入提供了 O(logn) 的最坏情况复杂度，而跳过列表在平均情况下提供相同的时间复杂度。 Also, binary tree would serve range queries in better time compared to skip list.此外，与跳过列表相比，二叉树将在更好的时间内提供范围查询。 For serving a conjunction query as well, lucene uses skip lists of multiple postings list to find their intersection - for this case too binary tree would have been enough.为了提供联合查询，lucene 使用多个发布列表的跳过列表来查找它们的交集 - 对于这种情况，二叉树就足够了。

Is there any specific reason skip list is used in lucene for indexing purposes that I have missed?是否有任何特定原因在 lucene 中使用跳过列表用于我错过的索引目的？

1 个解决方案

Lucene builds an inverted index using Skip-Lists on disk, and then loads a mapping for the indexed terms into memory using a Finite State Transducer (FST). Lucene 使用磁盘上的 Skip-Lists 构建倒排索引，然后使用有限 State 传感器将索引项的映射加载到 memory 中。 See this SO answer for How does lucene index documents?请参阅此 SO 答案以了解 lucene 如何索引文档？

In that answer, it also indicates that the primary benefit of using Skip-Lists it that it avoids ever having to rebalance a B-Tree.在那个答案中，它还表明使用 Skip- Lists的主要好处是它避免了重新平衡 B-Tree。 If you'd like to dig deeper that answer cite another one that provides a lot more detail: Skip List vs. Binary Search Tree Which intern references additional whitepapers.如果您想更深入地挖掘该答案，请引用另一个提供更多详细信息的答案： Skip List vs. Binary Search Tree实习生参考了其他白皮书。

Researching this a bit more, there is one other advantage to using Skip-Lists rather then a BTree.对此进行更多研究，使用 Skip-Lists 而不是 BTree 还有另一个优点。 It's not just the rebalancing that is avoided, but also avoided is the locking of a portion of the tree while the rebalancing is taking place.不仅避免了重新平衡，而且还避免了在重新平衡发生时锁定树的一部分。 This aspect is discussed further here .这方面将在此处进一步讨论。 This latter advantage improves concurrency.后一个优点提高了并发性。