简体繁体 English

如何在B +树中实现字符串键？

[英]how to implement the string key in B+ Tree?

原文 2010-12-15 00:27:00 5 2 c/ b-tree

许多b +树示例都是使用整数键实现的，但是我也看到了一些其他一些同时使用整数键和字符串键的示例，我了解了b +树的基础，但我不了解字符串键的工作原理？

2 个解决方案

I also use a multi level B-Tree. 我还使用了多层B树。 Having a string lets say test can be seen as an array of [t,e,s,t]. 拥有一个字符串可以说测试可以看作是[t，e，s，t]的数组。 Now think about a tree of trees. 现在考虑一棵树。 Each node can only hold one character for a certain position. 每个节点在特定位置只能容纳一个字符。 You also need to think about a certain key /value array implementation like a growing linked list of arrays, trees or whatever. 您还需要考虑某些键/值数组实现，例如数组，树或其他内容的不断增长的链接列表。 It also can make the node size dynamic (limited amount of letters). 它还可以使节点大小动态化（字母数量有限）。

If all keys fit the leaf, you store it in the leaf. 如果所有键都适合叶子，则将其存储在叶子中。 If the leaf gets to big, you can add new nodes. 如果叶子变大，则可以添加新节点。

And now since each node knows its letter and position, you can strip those characters from the keys in the leaf and reconstruct them as you search or if you know the leaf + the position in the leaf. 现在，由于每个节点都知道其字母和位置，因此您可以从叶子中的键中剥离这些字符，并在搜索时或如果您知道叶子+叶子中的位置来重构它们。

If you now, after you have created the tree, write the tree in a certain format, you end up having string compression where you store each letter combination (prefix) only once even if it is shared by 1000ends of strings. 如果现在，在创建树之后，以某种格式编写树，最终将遭受字符串压缩，即使每个字母组合（前缀）可以被1000个字符串共享，也只能存储一次。

Simple compression often results in a 1:10 compression for normal text (in any language!) and in memory in 1:4. 简单压缩通常会对普通文本（使用任何语言！）和内存以1：4进行1:10压缩。 And also you can search for any given word (which are the strings in your dictionary you used the B+Tree for. 而且，您还可以搜索任何给定的单词（这些单词是您使用B + Tree的字典中的字符串。

This is one extrem where you can use multilevel. 这是一个极端，您可以在其中使用多层。

Databases usually use a certain prefix tree (the first x characters and store the rest in the leafs and use binary search within the leaf). 数据库通常使用特定的前缀树（前x个字符，其余的存储在叶子中，并在叶子中使用二进制搜索）。 Also there are implementations that use variable prefix lengths based on the actual density. 也有一些基于实际密度使用可变前缀长度的实现。 So in the end it is very implementation specific and a lot of options exist. 因此，最终它是非常特定于实现的，并且存在许多选择。

If the tree should aid in finding the exact string. 如果树应有助于查找确切的字符串。 Often adding the length and using hash of lower bits of each characters do the trick. 通常会增加长度并使用每个字符的低位哈希来解决问题。 For example you could generate a hash out of length(8bit) + 4bit * 6 characters = 32Bit -> its your hash code. 例如，您可以生成长度超过（8bit）+ 4bit * 6个字符= 32Bit的哈希->其哈希代码。 Or you can use the first, last and middle characters along with it. 或者，您可以同时使用第一个，最后一个和中间字符。 Since the length is one of the most selective you wont find many collisions while search your string. 由于长度是最有选择性的长度之一，因此在搜索字符串时不会发现很多冲突。

This solution is very good for finding a particular string but destroyes the natural order of the strings so giving you no chance of answering range queries and alike. 此解决方案非常适合查找特定的字符串，但会破坏字符串的自然顺序，因此您没有机会回答范围查询等问题。 But for times where you search for a particular username / email or address those tree would be supperior (but question is why not use a hashmap). 但是在某些情况下，当您搜索特定的用户名/电子邮件或地址时，这些树就更好了（但问题是为什么不使用哈希图）。

The string key can be a pointer to a string (very likely). 字符串键可以是指向字符串的指针（很有可能）。

Or the key could be sized to fit most strings. 或者，键的大小可以适合大多数琴弦。 64 bits holds 8 byte strings and even 16 byte keys aren't too ridiculous. 64位包含8个字节的字符串，甚至16个字节的密钥也不太荒谬。

Choosing a key really depends on how you plan to use it. 选择密钥确实取决于您打算如何使用它。