简体繁体 English

找到B +树的中位数

[英]Finding the median in B+ tree

原文 2013-05-11 15:55:05 0 3 java/ algorithm

I need to implement a B+ tree. 我需要实现一个B +树。

And i need to create the following methods: 我需要创建以下方法：

Insert(x) - 0(log_t(x)). 插入（x） - 0（log_t（x））。
Search - Successful search - O(log_t(x)). 搜索 - 成功搜索 - O（log_t（x））。 Unsuccessful search - O(1) {With a high likely-hood} 不成功的搜索 - O（1）{有很高的可能性}

So i started with implementing Insert(x)- Each time i have a full leaf i want to split it up into two separated leaves. 所以我开始实现Insert（x） - 每次我有一个完整的叶子，我想把它分成两个分开的叶子。 One leaf with keys equal or lower to the median key, Second one will contains keys with higher value than the median. 一个键的键等于或低于中值键，第二个键将包含值高于中值的键。

How can i find this median without hurting the run-time? 如何在不损害运行时间的情况下找到此中位数？

I thought about: 我想过：

Representing each of the internal node and leaves as a smaller B+ tree, But then the median is the root (or one of the elements in the root) only when the tree is fully balanced. 将每个内部节点和叶子表示为较小的B +树，但只有当树完全平衡时，中位数才是根（或根中的一个元素）。
Representing each of the internal nodes and leaves as a doubly-linked list. 将每个内部节点和叶子表示为双向链表。 And trying to get the median key while the input is inserted, But there's input which doesn't work with it. 并且在插入输入时尝试获取中值键，但是输入不适用于它。
Representing as array might give me the middle, But then when i split it up i need at least O(n/2) to insert the keys into a new array. 表示为数组可能会给我中间，但是当我将其拆分时，我需要至少O（n / 2）将键插入到新数组中。

What can i do? 我能做什么？

And about the search, Idea-wise: The difference between a successful and unsuccessful search is about searching in the leaves, But i still need to 'run' through the different keys of the tree to determine whether the key is in the tree. 关于搜索，想法明智：成功和不成功搜索之间的区别在于在叶子中搜索，但我仍然需要通过树的不同键“运行”以确定密钥是否在树中。 So how can it be O(1)? 那怎么可能是O（1）？

3 个解决方案

In B+ trees, all the values are stored in the leaves. 在B +树中，所有值都存储在叶子中。

Note that you can add a pointer from each leaf to the following leaf, and you get in addition to the standard B+ tree an ordered linked list with all elements . 请注意，您可以将每个叶子的指针添加到下一个叶子，除了标准B +树之外，您还可以获得包含所有元素的有序链接列表 。

Now, note that assuming you know what the current median in this linked list is - upon insertion/deletion you can cheaply calculate the new median (it can be the same node, the next node or the previous node, no other choices). 现在，请注意，假设您知道此链接列表中的当前中位数是什么 - 在插入/删除时，您可以便宜地计算新的中位数 （它可以是相同的节点，下一个节点或前一个节点，没有其他选择）。
Note that modifying this pointer is O(1) (though the insertion/deletion itself is O(logn) . 请注意，修改此指针是O(1) （尽管插入/删除本身是O(logn) 。

Given that knowledge - one can cache a pointer to the median element and make sure to maintain it upon deletion/insertion. 鉴于这些知识 - 可以缓存指向中值元素的指针，并确保在删除/插入时保留它。 When you ask for median - just take the median from the cache - O(1) . 当你要求中位数时 - 只需从缓存中取中位数 - O(1) 。

Regarding Unsuccessful search - O(1) {With a high likely-hood} - this one screams bloom filters , which are aa probabilistic set implementation that never has false-negatives (never says something is not in set while it is), but has some false-positives (says something is in cache while in fact it isn't). 关于Unsuccessful search - O(1) {With a high likely-hood} - 这个是尖叫blo bloom过滤器 ，这是一个概率集实现，从来没有假阴性（从来没有说过某些东西没有设置），但是一些误报（说某些东西在缓存中，而事实上并非如此）。

You don't need the median of the B+-tree. 你不需要B + -tree的中位数。 You need the median key in the node you're splitting. 您需要在要拆分的节点中使用中值键。 You have to split at that median to satisfy the condition that each node has N/2 <= n <= N keys. 您必须在该中位数处进行拆分以满足每个节点具有N / 2 <= n <= N个键的条件。 The median key in a node is just the one in the middle, at n/2 , where n is the number of actual keys in the node. 节点中的中间密钥只是中间的密钥， n / 2 ，其中n是节点中实际密钥的数量。 That's where you split the node. 这就是你拆分节点的地方。 Computing that is O(1): it won't hurt the runtime. 计算是O（1）：它不会伤害运行时。

You can't get O(1) search failure time from a B+-tree without superimposing another data structure. 在不叠加其他数据结构的情况下，您无法从B +树获得O（1）搜索失败时间。

I've already posted an answer (and since deleted it), but it's possible I've misinterpreted, so here's an answer for another interpretation... 我已经发布了一个答案（并且已经删除了），但我可能会误解，所以这里有另一种解释的答案......

What if you need to always know which item is the median in the complete B+ tree container. 如果您需要始终知道哪个项目是完整 B +树容器中的中位数，该怎么办？

As amit says, you can keep a pointer (along with your root pointer) to the current leaf node that contains the median. 正如amit所说，你可以将指针（以及你的根指针）保存到包含中位数的当前叶节点。 You can also keep an index into that leaf node. 您还可以在该叶节点中保留索引。 So you get O(1) access by following those directly to the correct node and item. 因此，您可以通过直接跟踪到正确的节点和项目来获得O（1）访问权限。

The issue is maintaining that. 问题在于维持这一点。 Certainly amit is correct that for each insert, the median must also remain the same item, or must step to the one just before or after. 当然amit是正确的，对于每个插入，中位数也必须保持相同的项目，或者必须步骤到之前或之后的那个。 And if you have a linked list through the leaf nodes, that can be handled efficiently even if that means stepping to an adjacent leaf node. 如果您通过叶节点有一个链表，即使这意味着步进到相邻的叶节点，也可以有效地处理。

I'm not convinced, though, that's it's trivial to determine whether or which way to step, though, except in the special case where the median and the insert happen to be in the same leaf node. 但是，我不相信，确定是否或采用哪种方式都是微不足道的，除非在中位数和插入恰好位于同一叶节点的特殊情况下。

If you know the size of the complete tree (which you can easily store and maintain with the root pointer), you can at least determine which index the median item should be at both before and after the insert. 如果您知道完整树的大小（可以使用根指针轻松存储和维护），则至少可以确定插入前后中间项应该在哪个索引处。

However, you need to know if the previous median item had it's index shifted up by the insert - if the insert point was before or after the median. 但是，如果插入点位于中位数之前或之后，您需要知道先前的中位数项是否已将其指数向上移动。 Unless the insert point and median happen to be in the same node, that's a problem. 除非插入点和中位数碰巧在同一节点中，否则这是一个问题。

Overkill way - augment the B+ tree to support calculating the index of an item and searching for indexes. Overkill方式 - 扩充B +树以支持计算项目的索引并搜索索引。 The trick for that is that each node keeps a total of the number of items in the leaf nodes of its subtree. 其诀窍是每个节点保留其子树的叶节点中的项目总数。 That can be pushed up a level so each branch node has an array of subtree sizes along with its array of child node pointers. 这可以推高一个级别，因此每个分支节点都有一个子树大小数组及其子节点指针数组。

This offers two solutions. 这提供了两种解决方案 You could use the information to determine the index for the insert point as you search, or (providing nodes have parent pointers) you could use it to re-determine the index of the previous median item after the insert. 您可以使用该信息在搜索时确定插入点的索引，或者（提供节点具有父指针）您可以使用它来重新确定插入后的上一个中间项的索引。

[Actually three. [实际上是三个。 After inserting, you could just search for the new half-way index based on the new size without reference to the previous median link.] 插入后，你可以根据新的大小搜索新的中途索引而不参考前面的中间链接。]

In terms of data stored for augmentation, though, this turns out to be overkill. 然而，就存储用于增强的数据而言，这结果是过度的。 You don't need to know the index of the insert point or the previous median - you can make do with knowing which side of the median the insert occurred on. 您不需要知道插入点的索引或先前的中位数 - 您可以知道插入的中位数的哪一侧。 If you know the trail to follow from the root to the median item, you should be able to keep track of which side of it you are as you search for the insert point. 如果您知道要从根到中间项的跟踪，您应该能够在搜索插入点时跟踪它的哪一侧。 So you only need to augment with enough information to find and maintain that trail. 因此，您只需要增加足够的信息来查找和维护该踪迹。