简体繁体 English

关于加载因子的哈希表

[英]hash table about the load factor

原文 2015-10-15 12:02:15 3 2 data-structures/ hashtable/ load-factor

I'm studying about hash table for algorithm class and I became confused with the load factor. 我正在研究算法类的哈希表，我对负载因子感到困惑。 Why is the load factor, n/m, significant with 'n' being the number of elements and 'm' being the number of table slots? 为什么负载因子n / m显着，'n'是元素数，'m'是表槽数？ Also, why does this load factor equal the expected length of n(j), the linked list at slot j in the hash table when all of the elements are stored in a single slot? 另外，为什么这个加载因子等于n（j）的预期长度，当所有元素都存储在一个插槽中时，哈希表中插槽j的链表？

2 个解决方案

The crucial property of a hash table is the expected constant time it takes to look up an element.* 哈希表的关键属性是查找元素所需的预期时间 。*

In order to achieve this, the implementer of the hash table has to make sure that every query to the hash table returns below some fixed amount of steps. 为了实现这一点，哈希表的实现者必须确保对哈希表的每个查询都返回一些固定数量的步骤。

If you have a hash table with m buckets and you add elements indefinitely (ie n>>m ), then also the size of the lists will grow and you can't guarantee that expected constant time for look ups, but you will rather get linear time (since the running time you need to traverse the ever increasing linked lists will outweigh the lookup for the bucket). 如果你有一个带有m桶的哈希表并且无限期地添加元素（即n>>m ），那么列表的大小也会增长，你无法保证预期的查找时间，但你宁愿得到线性时间（因为你需要遍历不断增加的链表的运行时间将超过桶的查找）。

So, how can we achieve that the lists don't grow? 那么，我们怎样才能实现列表不会增长？ Well, you have to make sure that the length of the list is bounded by some fixed constant - how we do that? 那么，你必须确保列表的长度受一些固定常数的限制 - 我们如何做到这一点？ Well, we have to add additional buckets. 好吧，我们必须添加额外的桶。

If the hash table is well implemented, then the hash function being used to map the elements to buckets, should distribute the elements evenly across the buckets. 如果哈希表得到很好的实现，那么用于将元素映射到桶的哈希函数应该将元素均匀地分布在桶中。 If the hash function does this, then the length of the lists will be roughly the same. 如果散列函数执行此操作，则列表的长度将大致相同。

How long is one of the lists if the elements are distributed evenly? 如果元素均匀分布，其中一个列表有多长？ Clearly we'll have total number of elements divided by the number of buckets, ie the load factor n/m (number of elements per bucket = expected/average length of each list). 显然，我们将元素的总数除以桶的数量，即负载因子 n/m （每个桶的元素数量=每个列表的预期/平均长度）。

Hence, to ensure constant time look up, what we have to do is keep track of the load factor (again: expected length of the lists) such that, when it goes above the fixed constant we can add additional buckets. 因此，为了确保持续的时间查找，我们要做的是跟踪负载因子（再次：列表的预期长度），这样，当它超过固定常量时，我们可以添加额外的桶。

Of course, there are more problems which come in, such as how to redistribute the elements you already stored or how many buckets should you add. 当然，还有更多问题，例如如何重新分配已存储的元素或添加多少桶。

The important message to take away, is that the load factor is needed to decide when to add additional buckets to the hash table - that's why it is not only 'important' but crucial . 要带走的重要信息是，需要加载因子来决定何时向哈希表添加额外的桶 - 这就是为什么它不仅“重要”而且至关重要 。

Of course, if you map all the elements to the same bucket, then the average length of each list won't be worth much. 当然，如果将所有元素映射到同一个存储桶，那么每个列表的平均长度将不值得。 All this stuff only makes sense, if you distribute evenly across the buckets. 如果你在桶中均匀分布，所有这些东西才有意义。

*Note the expected - I can't emphasize this enough. *注意预期 - 我不能强调这一点。 Its typical to hear "hash table have constant look up time". 它通常听到“哈希表有不断的查找时间”。 They do not! 他们不！ Worst case is always O(n) and you can't make that go away. 最坏的情况总是O（n），你不能让它消失。

Adding to the existing answers, let me just put in a quick derivation. 加上现有的答案，让我简单介绍一下。

Consider a arbitrarily chosen bucket in the table. 考虑表中任意选择的桶。 Let X_i be the indicator random variable that equals 1 if the ith element is inserted into this element and 0 otherwise. 令X_i为指示符随机变量，如果第ith元素插入此元素则等于1 ，否则为0 。

We want to find E[X_1 + X_2 + ... + X_n]. 我们想要找到E[X_1 + X_2 + ... + X_n].

By linearity of expectation, this equals E[X_1] + E[X_2] + ... E[X_n] 通过期望的线性，这等于E[X_1] + E[X_2] + ... E[X_n]

Now we need to find the value of E[X_i]. 现在我们需要找到E[X_i].的值E[X_i]. This is simply (1/m) 1 + (1 - (1/m) 0) = 1/m by the definition of expected values. 通过预期值的定义，这简单地是(1/m) 1 + (1 - (1/m) 0) = 1/m 。 So summing up the values for all i's , we get 1/m + 1/m + 1/m n times. 因此，总结所有i's值，我们得到1/m + 1/m + 1/m n次。 This equals n/m. 这等于n/m. We have just found out the expected number of elements inserted into a random bucket and this is the load factor. 我们刚刚发现插入随机桶的预期元素数量，这是负载因子。