简体   繁体   English

如何选择哈希表的大小?

[英]How to choose size of hash table?

Suppose I have 200.000 of words, and I am going to use hash*33 + word[i] as a hash function, what should be the size of table for optimization, for minimum memory/paging issue? 假设我有200.000个单词,我将使用hash*33 + word[i]作为哈希函数,对于最小化内存/分页问题,​​应该是优化表的大小?

Platform used - C (c99 version), 使用平台 - C(c99版),

words are English char words, ASCII values 单词是英文字符,ASCII值

One time initialization of hash table (buckets of link list style), 一次初始化哈希表(链表样式的桶),

used for searching next, like dictionary search. 用于搜索下一个,如字典搜索。

After collision , that word will be added as new node into bucket. 碰撞后,该单词将作为新节点添加到存储桶中。

A good rule of thumb is to keep the load factor at 75% or less (some will say 70%) to maintain (very close to) O(1) lookup. 一个好的经验法则是将负载系数保持在75%或更低(有些人会说70%)以维持(非常接近)O(1)查找。 Assuming you have a good hash function. 假设你有一个很好的哈希函数。

Based on that, you would want a minimum of about 266,700 buckets (for 75%), or 285,700 buckets for 70%. 基于此,您需要至少约266,700个桶(75%)或285,700个桶(70%)。 That's assuming no collisions. 这假设没有碰撞。

That said, your best bet is to run a test with some sample data at various hash table sizes and see how many collisions you get. 也就是说,您最好的选择是使用各种哈希表大小的一些示例数据运行测试,并查看您获得的冲突数。

You might also consider a better hash function than hash*33 + word[i] . 你也可以考虑一个比hash*33 + word[i]更好的哈希函数。 The Jenkins hash and its variants require more computation, but they give a better distribution and thus will generally make for fewer collisions and a smaller required table size. Jenkins哈希及其变体需要更多的计算,但它们提供了更好的分布,因此通常会减少冲突并减少所需的表大小。

You could also just throw memory at the problem. 你也可以在这个问题上抛出记忆。 A table size of 500,000 gives you a minimum load factor of 40%, which could make up for shortcomings of your hash function. 表大小为500,000可以为您提供40%的最小加载因子,这可以弥补您的哈希函数的缺点。 However, you'll soon reach a point of diminishing returns. 但是,你很快就会达到收益递减的程度。 That is, making the table size 1 million gives you a theoretical load factor of 20%, but it's almost certain that you won't actually realize that. 也就是说,使表格大小为100万,可以给出20%的理论载荷因子,但几乎可以肯定的是,你实际上并没有意识到这一点。

Long story short: use a better hash function and do some testing at different table sizes. 长话短说:使用更好的哈希函数并在不同的表格大小上进行一些测试。

There is such a thing as a minimal perfect hash . 有一个最小的完美哈希这样的东西。 If you know what your input data is (ie, it doesn't change), then you can create a hash function that guarantees O(1) lookup. 如果您知道输入数据是什么(即,它没有改变),那么您可以创建一个保证O(1)查找的哈希函数。 It's also very space efficient. 它也非常节省空间。 However, I don't know how difficult it would be to create a minimal perfect hash for 200,000 items. 但是,我不知道为200,000个项目创建最小完美哈希是多么困难。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM