简体   繁体   中英

hash table about the load factor

I'm studying about hash table for algorithm class and I became confused with the load factor. Why is the load factor, n/m, significant with 'n' being the number of elements and 'm' being the number of table slots? Also, why does this load factor equal the expected length of n(j), the linked list at slot j in the hash table when all of the elements are stored in a single slot?

The crucial property of a hash table is the expected constant time it takes to look up an element.*

In order to achieve this, the implementer of the hash table has to make sure that every query to the hash table returns below some fixed amount of steps.

If you have a hash table with m buckets and you add elements indefinitely (ie n>>m ), then also the size of the lists will grow and you can't guarantee that expected constant time for look ups, but you will rather get linear time (since the running time you need to traverse the ever increasing linked lists will outweigh the lookup for the bucket).

So, how can we achieve that the lists don't grow? Well, you have to make sure that the length of the list is bounded by some fixed constant - how we do that? Well, we have to add additional buckets.

If the hash table is well implemented, then the hash function being used to map the elements to buckets, should distribute the elements evenly across the buckets. If the hash function does this, then the length of the lists will be roughly the same.

How long is one of the lists if the elements are distributed evenly? Clearly we'll have total number of elements divided by the number of buckets, ie the load factor n/m (number of elements per bucket = expected/average length of each list).

Hence, to ensure constant time look up, what we have to do is keep track of the load factor (again: expected length of the lists) such that, when it goes above the fixed constant we can add additional buckets.

Of course, there are more problems which come in, such as how to redistribute the elements you already stored or how many buckets should you add.

The important message to take away, is that the load factor is needed to decide when to add additional buckets to the hash table - that's why it is not only 'important' but crucial .


Of course, if you map all the elements to the same bucket, then the average length of each list won't be worth much. All this stuff only makes sense, if you distribute evenly across the buckets.

*Note the expected - I can't emphasize this enough. Its typical to hear "hash table have constant look up time". They do not! Worst case is always O(n) and you can't make that go away.

Adding to the existing answers, let me just put in a quick derivation.

Consider a arbitrarily chosen bucket in the table. Let X_i be the indicator random variable that equals 1 if the ith element is inserted into this element and 0 otherwise.

We want to find E[X_1 + X_2 + ... + X_n].

By linearity of expectation, this equals E[X_1] + E[X_2] + ... E[X_n]

Now we need to find the value of E[X_i]. This is simply (1/m) 1 + (1 - (1/m) 0) = 1/m by the definition of expected values. So summing up the values for all i's , we get 1/m + 1/m + 1/m n times. This equals n/m. We have just found out the expected number of elements inserted into a random bucket and this is the load factor.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM