为什么使用2的幂作为哈希大小会使哈希表比使用素数差很多？

Question

I'm implementing a hash table that is supposed to store pairs of 32-bit values. 我正在实现一个哈希表，该表应该存储成对的32位值。 Considering my elements are fixed size, I'm using a very simple hashing function: 考虑到我的元素是固定大小的，因此我使用了一个非常简单的哈希函数：

hash(a,b) = asUint64(a) + (asUint64(b) << 32)

With that, I get the index of an element in a hash table (that is, its corresponding bucket) with: 这样，我可以通过以下方式获得哈希表（即其对应的存储桶）中元素的索引：

index(a,b) = hash(a,b) % hash_size

Where hash_size is the number of entries/buckets on my table. 其中hash_size是表中条目/存储桶的数量。 I've realized, though, that I could speed up this implementation a little bit if I replaced the "modulus" operator by a bitwise mod of 2 , by fixing hash_size as a power of 2. Except, when I do that, most of my pairs end up on the first bucket! 不过，我已经意识到，如果我将“模数”运算符替换mod of 2的按位mod of 2 ，可以将hash_size固定为2的幂，则可以稍微加快此实现的速度。 除此以外，大多数我的双子以第一个桶告终！ Why is that happening? 为什么会这样呢？

Answer 1

My guess is that your data is not evenly distributed in a . 我的猜测是，你的数据不是均匀地分布在a 。 Consider the concatenation of a and b as your hash code: 将a和b的串联视为您的哈希码：

b31b30...b1b0a31a30...a1a0, where ai, bi is the ith bit of a,b

Suppose you have a table with a million entries, your hash index is then 假设您有一个包含一百万个条目的表，则您的哈希索引为

a9a8...a1a0 (as an integer)

Worse, suppose a only ever ranges from 1 to 100. Then you have even less dependence on the higher order bits of a . 更糟糕的是，假设a从1到100的范围。那么您对a高阶位的依赖就更少a 。

As you can see, if your hash table doesn't have at least 4 billion entries, your hashcode has no dependence on b at all, and hash(x, a) will collide for all x . 如您所见，如果您的哈希表没有至少40亿个条目，则您的哈希码完全不依赖b，并且hash(x, a)将对所有x发生冲突。

为什么使用2的幂作为哈希大小会使哈希表比使用素数差很多？

问题描述

1 个解决方案

解决方案1
2 已采纳 2014-10-17 21:37:09

为什么使用2的幂作为哈希​​大小会使哈希表比使用素数差很多？

问题描述

1 个解决方案

解决方案1 2 已采纳 2014-10-17 21:37:09

为什么使用2的幂作为哈希大小会使哈希表比使用素数差很多？

解决方案1
2 已采纳 2014-10-17 21:37:09