简体   繁体   中英

Finding the median in B+ tree

I need to implement a B+ tree.

And i need to create the following methods:

  1. Insert(x) - 0(log_t(x)).
  2. Search - Successful search - O(log_t(x)). Unsuccessful search - O(1) {With a high likely-hood}

So i started with implementing Insert(x)- Each time i have a full leaf i want to split it up into two separated leaves. One leaf with keys equal or lower to the median key, Second one will contains keys with higher value than the median.

How can i find this median without hurting the run-time?

I thought about:

  1. Representing each of the internal node and leaves as a smaller B+ tree, But then the median is the root (or one of the elements in the root) only when the tree is fully balanced.
  2. Representing each of the internal nodes and leaves as a doubly-linked list. And trying to get the median key while the input is inserted, But there's input which doesn't work with it.
  3. Representing as array might give me the middle, But then when i split it up i need at least O(n/2) to insert the keys into a new array.

What can i do?

And about the search, Idea-wise: The difference between a successful and unsuccessful search is about searching in the leaves, But i still need to 'run' through the different keys of the tree to determine whether the key is in the tree. So how can it be O(1)?

In B+ trees, all the values are stored in the leaves.

Note that you can add a pointer from each leaf to the following leaf, and you get in addition to the standard B+ tree an ordered linked list with all elements .

Now, note that assuming you know what the current median in this linked list is - upon insertion/deletion you can cheaply calculate the new median (it can be the same node, the next node or the previous node, no other choices).
Note that modifying this pointer is O(1) (though the insertion/deletion itself is O(logn) .

Given that knowledge - one can cache a pointer to the median element and make sure to maintain it upon deletion/insertion. When you ask for median - just take the median from the cache - O(1) .


Regarding Unsuccessful search - O(1) {With a high likely-hood} - this one screams bloom filters , which are aa probabilistic set implementation that never has false-negatives (never says something is not in set while it is), but has some false-positives (says something is in cache while in fact it isn't).

You don't need the median of the B+-tree. You need the median key in the node you're splitting. You have to split at that median to satisfy the condition that each node has N/2 <= n <= N keys. The median key in a node is just the one in the middle, at n/2 , where n is the number of actual keys in the node. That's where you split the node. Computing that is O(1): it won't hurt the runtime.

You can't get O(1) search failure time from a B+-tree without superimposing another data structure.

I've already posted an answer (and since deleted it), but it's possible I've misinterpreted, so here's an answer for another interpretation...

What if you need to always know which item is the median in the complete B+ tree container.

As amit says, you can keep a pointer (along with your root pointer) to the current leaf node that contains the median. You can also keep an index into that leaf node. So you get O(1) access by following those directly to the correct node and item.

The issue is maintaining that. Certainly amit is correct that for each insert, the median must also remain the same item, or must step to the one just before or after. And if you have a linked list through the leaf nodes, that can be handled efficiently even if that means stepping to an adjacent leaf node.

I'm not convinced, though, that's it's trivial to determine whether or which way to step, though, except in the special case where the median and the insert happen to be in the same leaf node.

If you know the size of the complete tree (which you can easily store and maintain with the root pointer), you can at least determine which index the median item should be at both before and after the insert.

However, you need to know if the previous median item had it's index shifted up by the insert - if the insert point was before or after the median. Unless the insert point and median happen to be in the same node, that's a problem.

Overkill way - augment the B+ tree to support calculating the index of an item and searching for indexes. The trick for that is that each node keeps a total of the number of items in the leaf nodes of its subtree. That can be pushed up a level so each branch node has an array of subtree sizes along with its array of child node pointers.

This offers two solutions. You could use the information to determine the index for the insert point as you search, or (providing nodes have parent pointers) you could use it to re-determine the index of the previous median item after the insert.

[Actually three. After inserting, you could just search for the new half-way index based on the new size without reference to the previous median link.]

In terms of data stored for augmentation, though, this turns out to be overkill. You don't need to know the index of the insert point or the previous median - you can make do with knowing which side of the median the insert occurred on. If you know the trail to follow from the root to the median item, you should be able to keep track of which side of it you are as you search for the insert point. So you only need to augment with enough information to find and maintain that trail.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM