简体   繁体   English

为什么在 hashCode 中使用质数?

[英]Why use a prime number in hashCode?

I was just wondering why is that primes are used in a class's hashCode() method?我只是想知道为什么在类的hashCode()方法中使用素数? For example, when using Eclipse to generate my hashCode() method there is always the prime number 31 used:例如,当使用 Eclipse 生成我的hashCode()方法时,总是使用质数31

public int hashCode() {
     final int prime = 31;
     //...
}

References:参考资料:

Here is a good primer on Hashcode and article on how hashing works that I found (C# but the concepts are transferrable): Eric Lippert's Guidelines and rules for GetHashCode()这是我发现的关于 Hashcode 和散列如何工作的一篇很好的入门书(C#,但概念是可转移的): Eric Lippert 的 GetHashCode() 指南和规则

Prime numbers are chosen to best distribute data among hash buckets.选择质数以在散列桶之间最好地分配数据。 If the distribution of inputs is random and evenly spread, then the choice of the hash code/modulus does not matter.如果输入的分布是随机且均匀分布的,则哈希码/模数的选择无关紧要。 It only has an impact when there is a certain pattern to the inputs.只有当输入存在某种模式时,它才会产生影响。

This is often the case when dealing with memory locations.处理内存位置时经常会出现这种情况。 For example, all 32-bit integers are aligned to addresses divisible by 4. Check out the table below to visualize the effects of using a prime vs. non-prime modulus:例如,所有 32 位整数都与可被 4 整除的地址对齐。查看下表以可视化使用质数与非质数模数的效果:

Input       Modulo 8    Modulo 7
0           0           0
4           4           4
8           0           1
12          4           5
16          0           2
20          4           6
24          0           3
28          4           0

Notice the almost-perfect distribution when using a prime modulus vs. a non-prime modulus.请注意使用质数模数与非质数模数时几乎完美的分布。

However, although the above example is largely contrived, the general principle is that when dealing with a pattern of inputs , using a prime number modulus will yield the best distribution.然而,尽管上面的例子在很大程度上是人为设计的,但一般原则是,在处理输入模式时,使用质数模数将产生最佳分布。

Because you want the number you are multiplying by and the number of buckets you are inserting into to have orthogonal prime factorizations.因为您希望乘以的数字和插入的桶数具有正交素数分解。

Suppose there are 8 buckets to insert into.假设有 8 个桶要插入。 If the number you are using to multiply by is some multiple of 8, then the bucket inserted into will only be determined by the least significant entry (the one not multiplied at all).如果您用来乘以的数字是 8 的某个倍数,那么插入的存储桶将仅由最不重要的条目(根本没有乘以的条目)确定。 Similar entries will collide.类似的条目会发生冲突。 Not good for a hash function.不适合散列函数。

31 is a large enough prime that the number of buckets is unlikely to be divisible by it (and in fact, modern java HashMap implementations keep the number of buckets to a power of 2). 31 是一个足够大的素数,桶的数量不太可能被它整除(事实上,现代 java HashMap 实现将桶的数量保持为 2 的幂)。

For what it's worth, Effective Java 2nd Edition hand-waives around the mathematics issue and just say that the reason to choose 31 is:值得一提的是, Effective Java 2nd Edition放弃了数学问题,只是说选择 31 的原因是:

  • Because it's an odd prime, and it's "traditional" to use primes因为它是一个奇素数,而且使用素数是“传统的”
  • It's also one less than a power of two, which permits for bitwise optimization它也小于 2 的幂,这允许按位优化

Here's the full quote, from Item 9: Always override hashCode when you override equals :这是第 9 项中的完整引用:当您覆盖equals时,始终覆盖hashCode

The value 31 was chosen because it's an odd prime.选择值 31 是因为它是一个奇质数。 If it were even and multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting.如果它是偶数并且乘法溢出,信息就会丢失,因为乘以 2 相当于移位。 The advantage of using a prime is less clear, but it is traditional.使用素数的优点不太清楚,但它是传统的。

A nice property of 31 is that the multiplication can be replaced by a shift ( §15.19 ) and subtraction for better performance: 31 的一个很好的属性是乘法可以用移位(第15.19 节)和减法代替以获得更好的性能:

 31 * i == (i << 5) - i

Modern VMs do this sort of optimization automatically.现代虚拟机会自动进行这种优化。


While the recipe in this item yields reasonably good hash functions, it does not yield state-of-the-art hash functions, nor do Java platform libraries provide such hash functions as of release 1.6.虽然本项中的配方产生了相当好的散列函数,但它没有产生最先进的散列函数,Java 平台库也没有提供这样的散列函数(截至 1.6 版)。 Writing such hash functions is a research topic, best left to mathematicians and theoretical computer scientists.编写这样的散列函数是一个研究课题,最好留给数学家和理论计算机科学家。

Perhaps a later release of the platform will provide state-of-the-art hash functions for its classes and utility methods to allow average programmers to construct such hash functions.也许该平台的后续版本将为其类和实用程序方法提供最先进的散列函数,以允许普通程序员构建此类散列函数。 In the meantime, the techniques described in this item should be adequate for most applications.同时,本项中描述的技术应该适用于大多数应用程序。

Rather simplistically, it can be said that using a multiplier with numerous divisors will result in more hash collisions .更简单地说,可以说使用具有多个除数的乘数会导致更多的哈希冲突 Since for effective hashing we want to minimize the number of collisions, we try to use a multiplier that has fewer divisors.由于为了有效散列,我们希望最小化冲突的数量,我们尝试使用具有较少除数的乘法器。 A prime number by definition has exactly two distinct, positive divisors.根据定义,素数正好有两个不同的正除数。

Related questions相关问题

我听说选择了 31 以便编译器可以优化乘法左移 5 位然后减去该值。

Here's a citation a little closer to the source.这是一个更接近来源的引文

It boils down to:归结为:

  • 31 is prime, which reduces collisions 31 是素数,减少冲突
  • 31 produces a good distribution, with 31 产生良好的分布,具有
  • a reasonable tradeoff in speed速度的合理权衡

First you compute the hash value modulo 2^32 (the size of an int ), so you want something relatively prime to 2^32 (relatively prime means that there are no common divisors).首先你计算哈希值模 2^32(一个int的大小),所以你想要一些与 2^32 相对质数的东西(相对质数意味着没有公约数)。 Any odd number would do for that.任何奇数都可以。

Then for a given hash table the index is usually computed from the hash value modulo the size of the hash table, so you want something that is relatively prime to the size of the hash table.然后对于给定的哈希表,索引通常是根据哈希值以哈希表的大小为模计算的,因此您需要一些与哈希表的大小相对质数的东西。 Often the sizes of hash tables are chosen as prime numbers for that reason.出于这个原因,哈希表的大小通常被选为素数。 In the case of Java the Sun implementation makes sure that the size is always a power of two, so an odd number would suffice here, too.在 Java 的情况下,Sun 实现确保大小始终是 2 的幂,因此这里奇数也足够了。 There is also some additional massaging of the hash keys to limit collisions further.还有一些额外的散列键按摩以进一步限制冲突。

The bad effect if the hash table and the multiplier had a common factor n could be that in certain circumstances only 1/n entries in the hash table would be used.如果散列表和乘数有一个公因子n的坏影响可能是在某些情况下散列表中只有 1/n 个条目会被使用。

The reason why prime numbers are used is to minimize collisions when the data exhibits some particular patterns.使用质数的原因是在数据表现出某些特定模式时尽量减少冲突。

First things first: If the data is random then there's no need for a prime number, you can do a mod operation against any number and you will have the same number of collisions for each possible value of the modulus.首先要做的事情是:如果数据是随机的,则不需要质数,您可以对任何数字进行模运算,并且对于模数的每个可能值,您将有相同数量的碰撞。

But when data is not random then strange things happen.但是当数据不是随机的时,就会发生奇怪的事情。 For example consider numeric data that is always a multiple of 10.例如,考虑始终是 10 倍数的数字数据。

If we use mod 4 we find:如果我们使用 mod 4,我们会发现:

10 mod 4 = 2 10 模 4 = 2

20 mod 4 = 0 20 模 4 = 0

30 mod 4 = 2 30 模 4 = 2

40 mod 4 = 0 40 模 4 = 0

50 mod 4 = 2 50 模 4 = 2

So from the 3 possible values of the modulus (0,1,2,3) only 0 and 2 will have collisions, that is bad.因此,从模数 (0,1,2,3) 的 3 个可能值中,只有 0 和 2 会发生冲突,这很糟糕。

If we use a prime number like 7:如果我们使用像 7 这样的质数:

10 mod 7 = 3 10 模 7 = 3

20 mod 7 = 6 20 模 7 = 6

30 mod 7 = 2 30 模 7 = 2

40 mod 7 = 4 40 模 7 = 4

50 mod 7 = 1 50 模 7 = 1

etc

We also note that 5 is not a good choice but 5 is prime the reason is that all our keys are a multiple of 5. This means we have to choose a prime number that doesn't divide our keys, choosing a large prime number is usually enough.我们还注意到 5 不是一个好的选择,但 5 是质数,原因是我们所有的键都是 5 的倍数。这意味着我们必须选择一个不能整除我们的键的质数,选择一个大的质数是通常足够了。

So erring on the side of being repetitive the reason prime numbers are used is to neutralize the effect of patterns in the keys in the distribution of collisions of a hash function.因此,在重复方面犯了错误,使用质数的原因是为了抵消散列函数冲突分布中键中模式的影响。

31 is also specific to Java HashMap which uses a int as hash data type. 31 也特定于 Java HashMap,它使用 int 作为哈希数据类型。 Thus the max capacity of 2^32.因此最大容量为 2^32。 There is no point in using larger Fermat or Mersenne primes.使用更大的费马或梅森素数是没有意义的。

它通常有助于在散列桶之间实现更均匀的数据分布,尤其是对于低熵键。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM