[英]How to choose size of hash table?
Suppose I have 200.000 of words, and I am going to use hash*33 + word[i]
as a hash function, what should be the size of table for optimization, for minimum memory/paging issue? 假设我有200.000个单词,我将使用
hash*33 + word[i]
作为哈希函数,对于最小化内存/分页问题,应该是优化表的大小?
Platform used - C (c99 version), 使用平台 - C(c99版),
words are English char words, ASCII values 单词是英文字符,ASCII值
One time initialization of hash table (buckets of link list style), 一次初始化哈希表(链表样式的桶),
used for searching next, like dictionary search. 用于搜索下一个,如字典搜索。
After collision , that word will be added as new node into bucket. 碰撞后,该单词将作为新节点添加到存储桶中。
A good rule of thumb is to keep the load factor at 75% or less (some will say 70%) to maintain (very close to) O(1) lookup. 一个好的经验法则是将负载系数保持在75%或更低(有些人会说70%)以维持(非常接近)O(1)查找。 Assuming you have a good hash function.
假设你有一个很好的哈希函数。
Based on that, you would want a minimum of about 266,700 buckets (for 75%), or 285,700 buckets for 70%. 基于此,您需要至少约266,700个桶(75%)或285,700个桶(70%)。 That's assuming no collisions.
这假设没有碰撞。
That said, your best bet is to run a test with some sample data at various hash table sizes and see how many collisions you get. 也就是说,您最好的选择是使用各种哈希表大小的一些示例数据运行测试,并查看您获得的冲突数。
You might also consider a better hash function than hash*33 + word[i]
. 你也可以考虑一个比
hash*33 + word[i]
更好的哈希函数。 The Jenkins hash and its variants require more computation, but they give a better distribution and thus will generally make for fewer collisions and a smaller required table size. Jenkins哈希及其变体需要更多的计算,但它们提供了更好的分布,因此通常会减少冲突并减少所需的表大小。
You could also just throw memory at the problem. 你也可以在这个问题上抛出记忆。 A table size of 500,000 gives you a minimum load factor of 40%, which could make up for shortcomings of your hash function.
表大小为500,000可以为您提供40%的最小加载因子,这可以弥补您的哈希函数的缺点。 However, you'll soon reach a point of diminishing returns.
但是,你很快就会达到收益递减的程度。 That is, making the table size 1 million gives you a theoretical load factor of 20%, but it's almost certain that you won't actually realize that.
也就是说,使表格大小为100万,可以给出20%的理论载荷因子,但几乎可以肯定的是,你实际上并没有意识到这一点。
Long story short: use a better hash function and do some testing at different table sizes. 长话短说:使用更好的哈希函数并在不同的表格大小上进行一些测试。
There is such a thing as a minimal perfect hash . 有一个最小的完美哈希这样的东西。 If you know what your input data is (ie, it doesn't change), then you can create a hash function that guarantees O(1) lookup.
如果您知道输入数据是什么(即,它没有改变),那么您可以创建一个保证O(1)查找的哈希函数。 It's also very space efficient.
它也非常节省空间。 However, I don't know how difficult it would be to create a minimal perfect hash for 200,000 items.
但是,我不知道为200,000个项目创建最小完美哈希是多么困难。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.