简体   繁体   中英

How to choose size of hash table?

Suppose I have 200.000 of words, and I am going to use hash*33 + word[i] as a hash function, what should be the size of table for optimization, for minimum memory/paging issue?

Platform used - C (c99 version),

words are English char words, ASCII values

One time initialization of hash table (buckets of link list style),

used for searching next, like dictionary search.

After collision , that word will be added as new node into bucket.

A good rule of thumb is to keep the load factor at 75% or less (some will say 70%) to maintain (very close to) O(1) lookup. Assuming you have a good hash function.

Based on that, you would want a minimum of about 266,700 buckets (for 75%), or 285,700 buckets for 70%. That's assuming no collisions.

That said, your best bet is to run a test with some sample data at various hash table sizes and see how many collisions you get.

You might also consider a better hash function than hash*33 + word[i] . The Jenkins hash and its variants require more computation, but they give a better distribution and thus will generally make for fewer collisions and a smaller required table size.

You could also just throw memory at the problem. A table size of 500,000 gives you a minimum load factor of 40%, which could make up for shortcomings of your hash function. However, you'll soon reach a point of diminishing returns. That is, making the table size 1 million gives you a theoretical load factor of 20%, but it's almost certain that you won't actually realize that.

Long story short: use a better hash function and do some testing at different table sizes.

There is such a thing as a minimal perfect hash . If you know what your input data is (ie, it doesn't change), then you can create a hash function that guarantees O(1) lookup. It's also very space efficient. However, I don't know how difficult it would be to create a minimal perfect hash for 200,000 items.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM