简体繁体 English

如何选择哈希表的大小？

[英]How to choose size of hash table?

原文 2014-03-30 08:40:16 1 1 c/ data-structures/ hash/ hashmap

Suppose I have 200.000 of words, and I am going to use hash*33 + word[i] as a hash function, what should be the size of table for optimization, for minimum memory/paging issue? 假设我有200.000个单词，我将使用hash*33 + word[i]作为哈希函数，对于最小化内存/分页问题，应该是优化表的大小？

Platform used - C (c99 version), 使用平台 - C（c99版），

words are English char words, ASCII values 单词是英文字符，ASCII值

One time initialization of hash table (buckets of link list style), 一次初始化哈希表（链表样式的桶），

used for searching next, like dictionary search. 用于搜索下一个，如字典搜索。

After collision , that word will be added as new node into bucket. 碰撞后，该单词将作为新节点添加到存储桶中。

1 个解决方案

A good rule of thumb is to keep the load factor at 75% or less (some will say 70%) to maintain (very close to) O(1) lookup. 一个好的经验法则是将负载系数保持在75％或更低（有些人会说70％）以维持（非常接近）O（1）查找。 Assuming you have a good hash function. 假设你有一个很好的哈希函数。

Based on that, you would want a minimum of about 266,700 buckets (for 75%), or 285,700 buckets for 70%. 基于此，您需要至少约266,700个桶（75％）或285,700个桶（70％）。 That's assuming no collisions. 这假设没有碰撞。

That said, your best bet is to run a test with some sample data at various hash table sizes and see how many collisions you get. 也就是说，您最好的选择是使用各种哈希表大小的一些示例数据运行测试，并查看您获得的冲突数。

You might also consider a better hash function than hash*33 + word[i] . 你也可以考虑一个比hash*33 + word[i]更好的哈希函数。 The Jenkins hash and its variants require more computation, but they give a better distribution and thus will generally make for fewer collisions and a smaller required table size. Jenkins哈希及其变体需要更多的计算，但它们提供了更好的分布，因此通常会减少冲突并减少所需的表大小。

You could also just throw memory at the problem. 你也可以在这个问题上抛出记忆。 A table size of 500,000 gives you a minimum load factor of 40%, which could make up for shortcomings of your hash function. 表大小为500,000可以为您提供40％的最小加载因子，这可以弥补您的哈希函数的缺点。 However, you'll soon reach a point of diminishing returns. 但是，你很快就会达到收益递减的程度。 That is, making the table size 1 million gives you a theoretical load factor of 20%, but it's almost certain that you won't actually realize that. 也就是说，使表格大小为100万，可以给出20％的理论载荷因子，但几乎可以肯定的是，你实际上并没有意识到这一点。

Long story short: use a better hash function and do some testing at different table sizes. 长话短说：使用更好的哈希函数并在不同的表格大小上进行一些测试。

There is such a thing as a minimal perfect hash . 有一个最小的完美哈希这样的东西。 If you know what your input data is (ie, it doesn't change), then you can create a hash function that guarantees O(1) lookup. 如果您知道输入数据是什么（即，它没有改变），那么您可以创建一个保证O（1）查找的哈希函数。 It's also very space efficient. 它也非常节省空间。 However, I don't know how difficult it would be to create a minimal perfect hash for 200,000 items. 但是，我不知道为200,000个项目创建最小完美哈希是多么困难。