简体   繁体   English

如何在哈希表和Trie(前缀树)之间进行选择?

[英]How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?

So if I have to choose between a hash table or a prefix tree what are the discriminating factors that would lead me to choose one over the other. 因此,如果我必须在哈希表或前缀树之间进行选择,那么哪些区别因素会导致我选择一个而不是另一个。 From my own naive point of view it seems as though using a trie has some extra overhead since it isn't stored as an array but that in terms of run time (assuming the longest key is the longest english word) it can be essentially O(1) (in relation to the upper bound). 从我自己的天真的角度来看,似乎使用trie有一些额外的开销,因为它没有存储为数组但是就运行时而言(假设最长的键是最长的英语单词)它可以基本上是O (1)(就上限而言)。 Maybe the longest english word is 50 characters? 也许最长的英文单词是50个字符?

Hash tables are instant look up once you get the index . 获得索引后,哈希表会立即查找。 Hashing the key to get the index however seems like it could easily take near 50 steps. 然而,散列获得索引的关键似乎很容易接近50步。

Can someone provide me a more experienced perspective on this? 有人能为我提供更有经验的观点吗? Thanks! 谢谢!

Advantages of tries: 尝试的优点:

The basics: 基础:

  • Predictable O(k) lookup time where k is the size of the key 可预测的O(k)查找时间,其中k是密钥的大小
  • Lookup can take less than k time if it's not there 如果不存在,查找可能需要不到k的时间
  • Supports ordered traversal 支持有序遍历
  • No need for a hash function 不需要哈希函数
  • Deletion is straightforward 删除很简单

New operations: 新业务:

  • You can quickly look up prefixes of keys, enumerate all entries with a given prefix, etc. 您可以快速查找键的前缀,枚举具有给定前缀的所有条目等。

Advantages of linked structure: 链接结构的优点:

  • If there are many common prefixes, the space they require is shared. 如果有许多公共前缀,则共享它们所需的空间。
  • Immutable tries can share structure. 不可变的尝试可以共享结构。 Instead of updating a trie in place, you can build a new one that's different only along one branch, elsewhere pointing into the old trie. 您可以构建一个新的,只在一个分支上有所不同,而在其他地方指向旧的trie,而不是更新trie。 This can be useful for concurrency, multiple simultaneous versions of a table, etc. 这对于并发,表的多个同时版本等非常有用。
  • An immutable trie is compressible. 不可变的特里是可压缩的。 That is, it can share structure on the suffixes as well, by hash-consing. 也就是说,它也可以通过散列来共享后缀上的结构。

Advantages of hashtables: 哈希表的优点:

  • Everyone knows hashtables, right? 每个人都知道哈希表,对吗? Your system will already have a nice well-optimized implementation, faster than tries for most purposes. 您的系统已经有一个很好的优化实现,比大多数目的尝试更快。
  • Your keys need not have any special structure. 您的钥匙不需要任何特殊结构。
  • More space-efficient than the obvious linked trie structure ( see comments below ) 比明显的链接结构更节省空间( 见下面的评论

It all depends on what problem you're trying to solve. 这一切都取决于你想要解决的问题。 If all you need to do is insertions and lookups, go with a hash table. 如果您只需要插入和查找,请使用哈希表。 If you need to solve more complex problems such as prefix-related queries, then a trie might be the better solution. 如果您需要解决更复杂的问题,例如与前缀相关的查询,那么trie可能是更好的解决方案。

Everyone knows hash table and its uses but it is not exactly constant look up time , it depends on how big the hash table is , the computational complexity of the hash function. 每个人都知道哈希表及其用途,但它不是完全恒定的查找时间,它取决于哈希表的大小,哈希函数的计算复杂性。

Creating huge hash tables for efficient lookup is not an elegant solution in most of the industrial scenarios where even small latency/scalability matters (eg: high frequency trading). 在大多数工业场景中创建大量哈希表以实现高效查找并不是一个优雅的解决方案,即使很小的延迟/可扩展性也很重要(例如:高频交易)。 You have to care about the data structures to be optimized for space it takes up in memory too to reduce cache miss. 您必须关心要在内存中占用的空间进行优化的数据结构,以减少缓存未命中。

A very good example where trie better suits the requirements is messaging middleware . 一个非常好的例子,其中trie更符合要求的是消息传递中间件。 You have a million subscribers and publishers of messages to various categories (in JMS terms - Topics or exchanges) , in such cases if you want to filter out messages based on topics (which are actually strings) , you definitely do not want create hash table for the million subscriptions with million topics . 您有一百万订阅者和各种类别的消息发布者(以JMS术语 - 主题或交换),在这种情况下,如果您想根据主题(实际上是字符串)过滤掉消息,您绝对不希望创建哈希表百万主题的百万订阅。 A better approach is store the topics in trie , so when filtering is done based on topic match , its complexity is independent of number of topics/subscriptions/publishers (only depends on the length of string). 更好的方法是将主题存储在trie中,因此当基于主题匹配进行过滤时,其复杂性与主题/订阅/发布者的数量无关(仅取决于字符串的长度)。 I like it because you can be creative with this data structure to optimize space requirements and hence have lower cache miss. 我喜欢它,因为您可以通过这种数据结构创造性地优化空间要求,从而降低缓存未命中率。

Use a tree: 使用树:

  1. If you need auto complete feature 如果您需要自动完成功能
  2. Find all words beginning with 'a' or 'axe' so on. 查找以'a'或'ax'开头的所有单词。
  3. A suffix tree is a special form of a tree. 后缀树是树的特殊形式。 Suffix trees have a whole list of advantages that hash cannot cover. 后缀树具有哈希无法涵盖的一系列优点。

There's something I haven't seen anyone mention explicitly that I think is important to keep in mind. 有些东西我没有看到任何人明确提到我认为重要的是要记住。 Both hash tables and tries of various kinds will typically have O(k) operations, where k is the length of the string in bits (or equivalently in chars). 散列表和各种尝试通常都具有O(k)运算,其中k是以位为单位的字符串的长度(或等效于字符)。

This is assuming you have a good hash function. 这假设你有一个很好的哈希函数。 If you don't want "farm" and "farm animals" to hash to the same value, then the hash function will have to use all the bits of the key, and so hashing "farm animals" should take about twice as long as "farm" (unless you're in some sort of rolling hash scenario, but there are somewhat similar operation-saving scenarios with tries too). 如果你不希望“farm”和“farm animals”散列到相同的值,那么hash函数将不得不使用密钥的所有位,因此散列“farm animals”应该花费大约两倍的时间。 “farm”(除非你处于某种滚动哈希方案中,但是有一些类似的操作保存方案也有尝试)。 And with a vanilla try, it's clear why inserting "farm animals" will take about twice as long as just "farm". 通过香草尝试,很明显为什么插入“农场动物”将花费大约两倍于“农场”的时间。 In the long run it's true with compressed tries as well. 从长远来看,压缩尝试也是如此。

HashTable implementation is space efficient as compared to basic Trie implementation. 与基本的Trie实现相比, HashTable实现节省空间。 But with strings, ordering is necessary in most of the practical applications. 但是对于字符串,在大多数实际应用中都需要排序。 But HashTable totally disturbs the lexographical order. 但是HashTable完全扰乱了词法秩序。 Now, if your application is doing operations based on lexographical order (like partial search, all strings with given prefix, all words in sorted order), you should use Tries. 现在,如果您的应用程序正在执行基于词法顺序的操作(如部分搜索,具有给定前缀的所有字符串,所有按排序顺序排列的单词),则应使用Tries。 For only lookup, HashTable should be used (as arguably, it gives minimum lookup time). 对于仅查找,应该使用HashTable(可以说,它给出了最小的查找时间)。

PS: Other than these, Ternary Search Trees (TSTs) would be an excellent choice. PS:除此之外, 三元搜索树(TST)将是一个很好的选择。 Its lookup time is more than HashTable, but is time-efficient in all other operations. 它的查找时间不仅仅是HashTable,而且在所有其他操作中都具有时间效率。 Also, its more space efficient than tries. 而且,它比尝试更节省空间。

Insertion and lookup on a trie is linear with the lengh of the input string O(s). trie上的插入和查找与输入字符串O(s)的长度成线性关系。

A hash will give you a O(1) for lookup ans insertion, but first you have to calculate the hash based on the input string which again is O(s). 哈希将为您提供查找和插入的O(1),但首先您必须根据输入字符串计算哈希值,该输入字符串也是O(s)。

Conclussion, the asymptotic time complexity is linear in both cases. 结论,渐近时间复杂度在两种情况下都是线性的。

The trie has some more overhead from data perspective, but you can choose a compressed trie which will put you again, more or less on a tie with the hash table. 从数据的角度来看,trie有一些额外的开销,但你可以选择一个压缩的trie,它会或多或少地与你的哈希表联系起来。

To break the tie ask yourself this question: Do i need to lookup for full words only? 打破平局问自己这个问题:我是否只需要查找完整的单词? Or do I need to return all words matching a prefix? 或者我是否需要返回与前缀匹配的所有单词? (As in a predictive text input system ). (如在预测文本输入系统中)。 For the first case, go for a hash. 对于第一种情况,请寻找哈希值。 It is simpler and cleaner code. 它更简单,更清晰。 Easier to test and maintain. 更容易测试和维护。 For a more ellaborated use case where prefixes or sufixes matter, go for a trie. 对于更加精心设计的前缀或前缀很重要的用例,请选择trie。

And if you do it just for fun, implementing a trie would put a Sunday afternoon to a good use. 如果你这样做只是为了好玩,实施一个特里会使周日下午得到很好的利用。

Some (usually embedded, real-time) applications require that the processing time be independent of the data. 一些(通常是嵌入式,实时)应用程序要求处理时间独立于数据。 In that case, a hash table can guarantee a known execution time, while a trie varies based on the data. 在这种情况下,哈希表可以保证已知的执行时间,而特里结构可以根据数据而变化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM