简体   繁体   English

为什么使用2的幂作为哈希​​大小会使哈希表比使用素数差很多?

[英]Why using powers of 2 as the hash size makes a hash table considerably worse than using primes?

I'm implementing a hash table that is supposed to store pairs of 32-bit values. 我正在实现一个哈希表,该表应该存储成对的32位值。 Considering my elements are fixed size, I'm using a very simple hashing function: 考虑到我的元素是固定大小的,因此我使用了一个非常简单的哈希函数:

hash(a,b) = asUint64(a) + (asUint64(b) << 32)

With that, I get the index of an element in a hash table (that is, its corresponding bucket) with: 这样,我可以通过以下方式获得哈希表(即其对应的存储桶)中元素的索引:

index(a,b) = hash(a,b) % hash_size

Where hash_size is the number of entries/buckets on my table. 其中hash_size是表中条目/存储桶的数量。 I've realized, though, that I could speed up this implementation a little bit if I replaced the "modulus" operator by a bitwise mod of 2 , by fixing hash_size as a power of 2. Except, when I do that, most of my pairs end up on the first bucket! 不过,我已经意识到,如果我将“模数”运算符替换mod of 2按位mod of 2 ,可以将hash_size固定为2的幂,则可以稍微加快此实现的速度。 除此以外,大多数我的双子以第一个桶告终! Why is that happening? 为什么会这样呢?

My guess is that your data is not evenly distributed in a . 我的猜测是,你的数据不是均匀地分布在a Consider the concatenation of a and b as your hash code: ab的串联视为您的哈希码:

b31b30...b1b0a31a30...a1a0, where ai, bi is the ith bit of a,b

Suppose you have a table with a million entries, your hash index is then 假设您有一个包含一百万个条目的表,则您的哈希索引为

a9a8...a1a0 (as an integer)

Worse, suppose a only ever ranges from 1 to 100. Then you have even less dependence on the higher order bits of a . 更糟糕的是,假设a从1到100的范围。那么您对a高阶位的依赖就更少a

As you can see, if your hash table doesn't have at least 4 billion entries, your hashcode has no dependence on b at all, and hash(x, a) will collide for all x . 如您所见,如果您的哈希表没有至少40亿个条目,则您的哈希码完全不依赖b,并且hash(x, a)将对所有x发生冲突。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM