[英]Why using powers of 2 as the hash size makes a hash table considerably worse than using primes?
I'm implementing a hash table that is supposed to store pairs of 32-bit values. 我正在实现一个哈希表,该表应该存储成对的32位值。 Considering my elements are fixed size, I'm using a very simple hashing function:
考虑到我的元素是固定大小的,因此我使用了一个非常简单的哈希函数:
hash(a,b) = asUint64(a) + (asUint64(b) << 32)
With that, I get the index of an element in a hash table (that is, its corresponding bucket) with: 这样,我可以通过以下方式获得哈希表(即其对应的存储桶)中元素的索引:
index(a,b) = hash(a,b) % hash_size
Where hash_size is the number of entries/buckets on my table. 其中hash_size是表中条目/存储桶的数量。 I've realized, though, that I could speed up this implementation a little bit if I replaced the "modulus" operator by a bitwise
mod of 2
, by fixing hash_size as a power of 2. Except, when I do that, most of my pairs end up on the first bucket! 不过,我已经意识到,如果我将“模数”运算符替换
mod of 2
的按位mod of 2
,可以将hash_size固定为2的幂,则可以稍微加快此实现的速度。 除此以外,大多数我的双子以第一个桶告终! Why is that happening? 为什么会这样呢?
My guess is that your data is not evenly distributed in a
. 我的猜测是,你的数据不是均匀地分布在
a
。 Consider the concatenation of a
and b
as your hash code: 将
a
和b
的串联视为您的哈希码:
b31b30...b1b0a31a30...a1a0, where ai, bi is the ith bit of a,b
Suppose you have a table with a million entries, your hash index is then 假设您有一个包含一百万个条目的表,则您的哈希索引为
a9a8...a1a0 (as an integer)
Worse, suppose a
only ever ranges from 1 to 100. Then you have even less dependence on the higher order bits of a
. 更糟糕的是,假设
a
从1到100的范围。那么您对a
高阶位的依赖就更少a
。
As you can see, if your hash table doesn't have at least 4 billion entries, your hashcode has no dependence on b at all, and hash(x, a)
will collide for all x
. 如您所见,如果您的哈希表没有至少40亿个条目,则您的哈希码完全不依赖b,并且
hash(x, a)
将对所有x
发生冲突。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.