简体   繁体   English

带有可变长度键的B +树

[英]B+ tree with variable length keys

In a common implementation of a B+ tree, we can assume that keys have a fixed length (eg 25 bytes). 在B +树的常见实现中,我们可以假设密钥具有固定长度(例如,25个字节)。 We can then define that each node must have a minimum number of keys, and a maximum. 然后我们可以定义每个节点必须具有最小数量的密钥和最大数量。

If I wanted the tree to accept variable length keys, what should I modify? 如果我希望树接受可变长度键,我应该修改什么? What if I say that the node must have at least 2 keys, but the key which i'm trying to insert is so big that it doesn't fit into the Block that holds the node? 如果我说节点必须至少有2个密钥,但是我试图插入的密钥是如此之大以至于它不适合保存节点的块,该怎么办?

The simple solution is to store the keys as pointers (wrapped in a type that overrides the relative operators etc) rather than values, but that of course damages the locality that is part of the point of using B+ trees. 简单的解决方案是将键存储为指针(包含在覆盖相对运算符等的类型中)而不是值,但这当然会破坏作为使用B +树的一部分的局部性。

That said, the larger the items, the less it matters that items are adjacent in memory. 也就是说,项目越大,项目在内存中相邻的重要性就越小。 Huge items won't fit even one to a cache page, let alone several in the same page. 巨大的项目甚至不适合缓存页面,更不用说同一页面中的几个。

Another relatively simple approach is to use a union type or placement new or whatever to allocate items within a memory-for-item type that's big enough for all item types you might use. 另一种相对简单的方法是使用联合类型或放置new或其他任何东西来分配项目内存类型中的项目,这些项目对于您可能使用的所有项目类型都足够大。 You still have a fixed number of bytes per item, but the items don't necessarily use all those bytes. 每个项目仍然有固定的字节数,但这些项目不一定使用所有这些字节。

If you're willing to do the work, you could have variable-sized nodes. 如果你愿意做这项工作,你可以拥有可变大小的节点。 You'll have some hassles working with those nodes, of course, depending on how you arrange the in-node data structure to cope with that. 当然,您可能会遇到一些麻烦,这取决于您如何安排节点内数据结构以应对这些节点。 You might have a small array of item-pointers within the node, for instance, pointing to the items which are also inside the node (not separately allocated on the heap). 例如,您可能在节点中有一小组项目指针,指向也在节点内的项目(未在堆上单独分配)。

Also, every time you change a node you may need to reallocate it. 此外,每次更改节点时,您可能需要重新分配它。 Even if all you're doing is rebalancing, that might move a huge item from one node into another, and even though the destination node has room in the sense of having a slot free for an item it may not have enough bytes to store the value. 即使你正在做的只是重新平衡,这可能会将一个巨大的项目从一个节点移动到另一个节点,即使目标节点在某个项目中有一个空闲区域的意义上也可能没有足够的字节来存储值。

In a sense, each node would be a mini-heap in which you can allocate or release space for items big or small, but sometimes you'd have to go back to the heap proper to replace that mini-heap with a bigger or smaller one. 从某种意义上说,每个节点都是一个小型堆,您可以在其中为大小的项目分配或释放空间,但有时您必须回到适当的堆来替换那个更大或更小的小堆一。

It's again worth mentioning that if the items are that huge, locality within a node probably isn't relevant anyway. 再次值得一提的是,如果项目非常庞大,节点内的位置可能无论如何都不相关。

I've implemented B+-style multiway trees in memory before myself, but I've never gone to this extreme. 我之前在内存中实现了B +风格的多路树,但我从来没有走到这个极端。

您可以将其余的大键保留在溢出页面中,就像那里一样。

Use hashing. 使用散列。 A hash is a fixed-size representation of a key. 哈希是密钥的固定大小表示。 For good hashing functions see http://www.cse.yorku.ca/~oz/hash.html . 有关良好的散列函数,请参阅http://www.cse.yorku.ca/~oz/hash.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM