简体   繁体   English

hashmap中的简单hashcode误解?

[英]Simple hashcode in hashmap misconception?

I am implementing my own specialized hashmap which has generic value types, but keys are always of type long. 我正在实现自己的专用哈希图,该哈希图具有泛型值类型,但键的类型始终为long。 Here and there, I am seeing people suggesting that I should multiply key by a prime and then get modulo by number of buckets: 在这里和那里,我看到有人建议我将键乘以一个质数,然后对存储桶数求模:

int bucket = (key * prime) % numOfBuckets;

and I don't understand why? 我不明白为什么? It seems to me that it has exactly the same distribution as simple: 在我看来,它具有与简单完全相同的分布:

int bucket = key % numOfBuckets;

For example, if numOfBuckets is 8, with second "algorithm" we get buckets like {0, 1, 2, 3, 4, 5, 6, 7} repeating for key = 0 to infinity. 例如,如果numOfBuckets为8,则在使用第二个“算法”时,我们会得到类似{0,1,2,3,4,5,6,7}的存储桶,其中key = 0重复到无穷大。 In first algorithm for same keys we get buckets {0, 3, 6, 1, 4, 7, 2, 5} (or similar) also repeating. 在第一个针对相同密钥的算法中,我们得到的桶{0、3、6、1、4、7、2、5}(或类似的桶)也在重复。 Basically we have the same problem like when using identity hash. 基本上,我们有相同的问题,例如使用身份哈希时。

Basically, in both cases we get collisions for keys: 基本上,在两种情况下,我们都会发生键冲突:

key = x + k*numOfBuckets (for k = 1 to infinity; and x = key % numOfBuckets)

because when we get modulo by numOfBuckets we always get x. 因为当我们通过numOfBuckets取模时,我们总是得到x。 So, what's the deal with first algorithm, can someone enlighten me? 那么,第一个算法有什么用,有人可以启发我吗?

If numOfBuckets is a power of two and the prime is odd (which seems to be the intended use case), then we have gcd(numOfBuckets, prime) == 1 . 如果numOfBuckets是2的幂并且prime是奇数(这似乎是预期的用例),则我们有gcd(numOfBuckets, prime) == 1 That in turn means there is a number inverse such that inverse * numOfBuckets = 1 (mod numOfBuckets) , so the multiplication is a bijective operation that just shuffles the buckets around a bit. 反过来,这意味着存在一个inverse数,使得inverse * numOfBuckets = 1 (mod numOfBuckets) ,因此乘法运算是一种双射运算,只是将存储桶改组一点。 That is of course useless, so your conclusions are correct. 那当然没用,所以您的结论是正确的。

Or perhaps more intuitively: in a multiplication information only flows from the lowest bit to the highest, never in reverse. 或更直观地说:在乘法中,信息仅从最低位流向最高位,从不反向。 So any of the bits that the bucket index would not rely on without the multiplication, are still discarded with the multiplication. 因此,任何铲斗指数不会没有乘上依靠位,依然丢弃乘法。

Some other techniques do help, for example Java's HashMap uses this: 其他一些技术也可以提供帮助,例如Java的HashMap使用此方法:

/**
 * Applies a supplemental hash function to a given hashCode, which
 * defends against poor quality hash functions.  This is critical
 * because HashMap uses power-of-two length hash tables, that
 * otherwise encounter collisions for hashCodes that do not differ
 * in lower bits. Note: Null keys always map to hash 0, thus index 0.
 */
static int hash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

An other thing that works is multiplying by some large constant and then using the upper bits of the result (which contain a mixture of the bits below them, so all bits of the key can be used that way). 另一可行的方法是将某个较大的常数相乘,然后使用结果的高位 (其中包含其下位的混合,因此可以以这种方式使用键的所有位)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM