简体   繁体   中英

How should I go about optimizing a hash table for a given population?

Say I have a population of key-value pairs which I plan to store in a hash table. The population is fixed and will never change. What optimizations are available to me to make the hash table as fast as possible? Which optimizations should I concentrate on? This is assuming I have a lot of space. There will be a reasonable number of pairs (say no more than 100,000).

EDIT: I want to optimize look up. I don't care how long it takes to build.

I would make sure that your key's hash to unique values. This will ensure that every lookup will be constant time, and thus, as fast as possible.

Since you can never have more than 100,000 keys, it is entirely possible to have 100,000 hash values.

Also, make sure that you use the constructor that takes an int to specify the initial capacity (Set it to 100,000), and a float to set the load factor. (Use 1 ) Also, doing this requires that you have a perfect hash function for your keys. But, this will result in the fastest possible lookup, in the least amount of memory.

In general, to optimize a hash table, you want to minimize collisions in the determination of your hash, so your buckets won't contain more than one item and the hash-search will return immediately.

Most of the time, that means that you should measure the output of your hash function on the problem space. So i guess i'd recommend looking into that

Ensure there are no collisions. If there are no collisions, you are guaranteed O(1) constant look-up time. The next optimization would then be the look-up.

Use a profiler to optimize piece by piece. It's hard to without that.

If it's possible to make a large hash table such that there are no collisions at all, it will be ideal. Since your insertions and lookups will done in constant time.

But if that is not possible, try to choose a hash function such that your keys get distributed uniformly across the hash table.

Perfect hashing algorithms deal with the problem, but may not scale to 100k objects. I found a Java MPH package , but haven't tried it.

If the population is known at compile time, then the optimal solution is to use a minimal perfect hash function (MPH). The Wikipedia page on this subject links to several Java tools that can generate these.

The optimization must be done int the hashCode method of the key class . The thing to have in mind is to implement this method to avoid collisions.

Getting the perfect hashing algorithm to give totally unique values to 100K objects is likely to be close to impossible. Consider the birthday paradox. The date on which people are born can be considered a perfect hashing algorithm yet if you have more than 23 people you are more than likely to have a collision, and that is in a table of 365 dates.

So how big a table will you need to have no collisions in 100K?

If your keys are strings, your optimal strategy is a tree, not binary but n-branch at each character. If the keys are lower-case only it is easier still as you need just 26 whenever you create a branch.

We start with 26 keys. Follow the first character, say ff might have a value associated with it. And it may have sub-trees. Look up a subtree of o. This leads to more subtrees then look up the next o. (You knew where that was leading!). If this doesn't have a value associated with it, or we hit a null sub-tree on the way, we know the value is not found.

You can optimise the space on the tree where you hit a point of uniqueness. Say you have a key january and it becomes unique at the 4th character. At this point where you assign the value you also store the actual string associated with it. In our example there may be one value associated with foo but the key it relates to may be food, not foo.

I think google search engines use a technique similar to this.

The key question is what your key is. (No pun intended.) As others have pointed out, the goal is to minimize the number of hash collisions. If you can get the number of hash collisions to zero, ie your hash function generates a unique value for every key that is actually passed to it, you will have a perfect hash.

Note that in Java, a hash function really has two steps: First the key is run through the hashCode function for it's class. Then we calculate an index value into the hash table by taking this value modulo the size of the hash table.

I think that people discussing the perfect hash function tend to forget that second step. Even if you wrote a hashCode function that generated a unique value for every key passed to it, you could still get an absolutely terrible hash if this value modulo the hash table size is not unique. For example, say you have 100 keys and your hashCode function returns the values 1, 1001, 2001, 3001, 4001, 5001, ... 99001. If your hash table has 100,000 slots, this would be a perfect hash. Every key gets its own slot. But if it has 1000 slots, they all hash to the same slot. It would be the worst possible hash.

So consider constructing a good hash function. Take the extreme cases. Suppose that your key is a date. You know that the dates will all be in January of the same year. Then using the day of the month as the hash value should be as good as it's going to get: everything will hash to a unique integer in a small range. On the other hand, if your dates were all the first of the month for many years and many months, taking the day of the month would be a terrible hash, as every actual key would map to "1".

My point being that if you really want to optimize your hash, you need to know the nature of your data. What is the actual range of values that you will get?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM