简体繁体 English

我应该如何为给定的人口优化哈希表？

[英]How should I go about optimizing a hash table for a given population?

原文 2010-10-11 13:51:16 2 9 java

Say I have a population of key-value pairs which I plan to store in a hash table. 假设我有一组键值对，我计划存储在哈希表中。 The population is fixed and will never change. 人口是固定的，永远不会改变。 What optimizations are available to me to make the hash table as fast as possible? 我可以使用哪些优化来尽可能快地创建哈希表？ Which optimizations should I concentrate on? 我应该集中精力进行哪些优化？ This is assuming I have a lot of space. 这假设我有很多空间。 There will be a reasonable number of pairs (say no more than 100,000). 将有合理数量的对（例如不超过100,000）。

EDIT: I want to optimize look up. 编辑：我想优化查找。 I don't care how long it takes to build. 我不在乎构建需要多长时间。

9 个解决方案

I would make sure that your key's hash to unique values. 我会确保你的密钥的哈希值为唯一值。 This will ensure that every lookup will be constant time, and thus, as fast as possible. 这将确保每次查找都是恒定的时间，因此尽可能快。

Since you can never have more than 100,000 keys, it is entirely possible to have 100,000 hash values. 由于您的密钥永远不会超过100,000，因此完全可以拥有100,000个哈希值。

Also, make sure that you use the constructor that takes an int to specify the initial capacity (Set it to 100,000), and a float to set the load factor. 此外，请确保使用带有int的构造函数指定初始容量（将其设置为100,000），并使用float来设置加载因子。 (Use 1 ) Also, doing this requires that you have a perfect hash function for your keys. （使用1 ）此外，这样做需要您的密钥具有完美的哈希函数。 But, this will result in the fastest possible lookup, in the least amount of memory. 但是，这将以最少的内存量导致最快的查找。

In general, to optimize a hash table, you want to minimize collisions in the determination of your hash, so your buckets won't contain more than one item and the hash-search will return immediately. 通常，为了优化哈希表，您希望在确定哈希值时最大限度地减少冲突，因此您的存储桶不会包含多个项目，并且哈希搜索将立即返回。

Most of the time, that means that you should measure the output of your hash function on the problem space. 大多数情况下，这意味着您应该在问题空间上测量哈希函数的输出。 So i guess i'd recommend looking into that 所以我想我会建议调查一下

Ensure there are no collisions. 确保没有碰撞。 If there are no collisions, you are guaranteed O(1) constant look-up time. 如果没有碰撞，则保证O（1）持续查找时间。 The next optimization would then be the look-up. 然后，下一个优化将是查找。

Use a profiler to optimize piece by piece. 使用分析器逐个优化。 It's hard to without that. 没有它，很难。

If it's possible to make a large hash table such that there are no collisions at all, it will be ideal. 如果可以制作一个大型哈希表，使其根本没有冲突，那么它将是理想的。 Since your insertions and lookups will done in constant time. 由于您的插入和查找将在恒定时间内完成。

But if that is not possible, try to choose a hash function such that your keys get distributed uniformly across the hash table. 但是如果这是不可能的，请尝试选择一个哈希函数，以便您的密钥在哈希表中均匀分布。

Perfect hashing algorithms deal with the problem, but may not scale to 100k objects. 完美的散列算法可以解决问题，但可能无法扩展到100k对象。 I found a Java MPH package , but haven't tried it. 我找到了一个Java MPH包，但还没有尝试过。

If the population is known at compile time, then the optimal solution is to use a minimal perfect hash function (MPH). 如果在编译时已知群体，则最佳解决方案是使用最小完美散列函数（MPH）。 The Wikipedia page on this subject links to several Java tools that can generate these. 关于此主题的Wikipedia页面链接到几个可以生成这些的Java工具。

The optimization must be done int the hashCode method of the key class . 必须在密钥class的hashCode方法中完成优化。 The thing to have in mind is to implement this method to avoid collisions. 要记住的是实现此方法以避免冲突。

Getting the perfect hashing algorithm to give totally unique values to 100K objects is likely to be close to impossible. 获得完美的哈希算法，为100K对象提供完全独特的值可能几乎是不可能的。 Consider the birthday paradox. 考虑一下生日悖论。 The date on which people are born can be considered a perfect hashing algorithm yet if you have more than 23 people you are more than likely to have a collision, and that is in a table of 365 dates. 人们出生的日期可以被认为是一种完美的哈希算法，如果你有超过23个人，你很可能会发生碰撞，那就是365个日期的表格。

So how big a table will you need to have no collisions in 100K? 那么你需要多大的表才能在100K中没有碰撞？

If your keys are strings, your optimal strategy is a tree, not binary but n-branch at each character. 如果您的键是字符串，那么您的最佳策略是树，而不是二进制，而是每个字符的n分支。 If the keys are lower-case only it is easier still as you need just 26 whenever you create a branch. 如果键是小写的，那么只要你创建一个分支时你只需要26就更容易了。

We start with 26 keys. 我们从26键开始。 Follow the first character, say ff might have a value associated with it. 按照第一个字符，说ff可能有一个与之关联的值。 And it may have sub-trees. 它可能有子树。 Look up a subtree of o. 查找o的子树。 This leads to more subtrees then look up the next o. 这导致更多的子树然后查找下一个o。 (You knew where that was leading!). （你知道那是领先的地方！）。 If this doesn't have a value associated with it, or we hit a null sub-tree on the way, we know the value is not found. 如果没有与之关联的值，或者我们在途中遇到了一个空子树，我们就知道找不到该值。

You can optimise the space on the tree where you hit a point of uniqueness. 您可以优化树上您达到唯一性的空间。 Say you have a key january and it becomes unique at the 4th character. 假设你有一个关键的1月，它在第4个角色变得独一无二。 At this point where you assign the value you also store the actual string associated with it. 此时，您分配值，您还存储与其关联的实际字符串。 In our example there may be one value associated with foo but the key it relates to may be food, not foo. 在我们的例子中，可能有一个与foo相关的值，但它与之相关的关键可能是食物，而不是foo。

I think google search engines use a technique similar to this. 我认为谷歌搜索引擎使用的技术类似于此。

The key question is what your key is. 关键问题是你的关键是什么。 (No pun intended.) As others have pointed out, the goal is to minimize the number of hash collisions. （没有双关语。）正如其他人所指出的那样，目标是最大限度地减少哈希冲突的数量。 If you can get the number of hash collisions to zero, ie your hash function generates a unique value for every key that is actually passed to it, you will have a perfect hash. 如果您可以将散列冲突的数量设置为零，即您的散列函数为实际传递给它的每个键生成唯一值，那么您将获得完美的散列。

Note that in Java, a hash function really has two steps: First the key is run through the hashCode function for it's class. 请注意，在Java中，哈希函数实际上有两个步骤：首先，密钥通过其类的hashCode函数运行。 Then we calculate an index value into the hash table by taking this value modulo the size of the hash table. 然后我们通过将此值作为哈希表的大小的模数来计算哈希表中的索引值。

I think that people discussing the perfect hash function tend to forget that second step. 我认为讨论完美哈希函数的人往往会忘记第二步。 Even if you wrote a hashCode function that generated a unique value for every key passed to it, you could still get an absolutely terrible hash if this value modulo the hash table size is not unique. 即使您编写了一个hashCode函数，该函数为传递给它的每个键生成一个唯一值，但如果以哈希表大小为模的这个值不唯一，您仍然可能得到一个绝对可怕的哈希值。 For example, say you have 100 keys and your hashCode function returns the values 1, 1001, 2001, 3001, 4001, 5001, ... 99001. If your hash table has 100,000 slots, this would be a perfect hash. 例如，假设你有100个密钥，你的hashCode函数返回值1,1001,2001,3001,4001,5001，... 99001.如果你的哈希表有100,000个插槽，这将是一个完美的哈希。 Every key gets its own slot. 每个密钥都有自己的插槽。 But if it has 1000 slots, they all hash to the same slot. 但如果它有1000个插槽，它们都会散列到相同的插槽。 It would be the worst possible hash. 这将是最糟糕的哈希。

So consider constructing a good hash function. 所以考虑构建一个好的哈希函数。 Take the extreme cases. 以极端的情况为例。 Suppose that your key is a date. 假设您的密钥是日期。 You know that the dates will all be in January of the same year. 您知道日期将在同一年的1月份。 Then using the day of the month as the hash value should be as good as it's going to get: everything will hash to a unique integer in a small range. 然后使用当月的日期作为哈希值应该与它将获得的一样好：所有内容都将散列为小范围内的唯一整数。 On the other hand, if your dates were all the first of the month for many years and many months, taking the day of the month would be a terrible hash, as every actual key would map to "1". 另一方面，如果你的日期是本月的第一个多年和几个月，那么每月的日期将是一个糟糕的哈希值，因为每个实际的密钥都会映射到“1”。

My point being that if you really want to optimize your hash, you need to know the nature of your data. 我的观点是，如果您真的想要优化哈希值，您需要知道数据的性质。 What is the actual range of values that you will get? 您将获得的实际值范围是多少？