简体   繁体   English

为什么以下三个字符串的哈希码相同?

[英]Why the following three strings's hashcode are same?

After reading the source code of JDK, I am still surprised that the strings "AaAa", "AaBB" and "BBBB" have the same hashcode. 在阅读了JDK的源代码之后,我仍然惊讶于字符串"AaAa", "AaBB" and "BBBB"具有相同的哈希码。

The source of JDK is as follows, JDK的来源如下:

int h = hash;
if (h == 0 && value.length > 0) {
    char val[] = value;

    for (int i = 0; i < value.length; i++) {
        h = 31 * h + val[i];
    }
    hash = h;
}
return h;

Anyone could clarify this? 任何人都可以澄清吗?

Because that's how the hash code is defined to be calculated for a String : 因为这就是定义要为String计算哈希码的方式

The hash code for a String object is computed as 字符串对象的哈希码计算为

 s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] 

So: 所以:

  • For AaAa : 65*31^3 + 97*31^2 + 65*31 + 97 = 2031744 对于AaAa65*31^3 + 97*31^2 + 65*31 + 97 = 2031744
  • For AaBB : 65*31^3 + 97*31^2 + 66*31 + 66 = 2031744 对于AaBB65*31^3 + 97*31^2 + 66*31 + 66 = 2031744
  • For BBBB : 66*31^3 + 66*31^2 + 66*31 + 66 = 2031744 对于BBBB66*31^3 + 66*31^2 + 66*31 + 66 = 2031744

Because probability . 因为概率

There are ~4 billion possible hash codes ( Integer.MIN_VALUE -> Integer.MAX_VALUE ) and basically infinite possible Strings. 可能有大约40亿个哈希码( Integer.MIN_VALUE -> Integer.MAX_VALUE )和基本上无限的字符串。 There are bound to be collisions . 必然会有碰撞 In fact, the birthday problem shows us that only ~77,000 strings are required for a high chance of an arbitrary collision - and that would be if the hash function had extremely high entropy, which it doesn't. 实际上, 生日问题向我们显示任意碰撞的可能性很高,仅需要〜77,000个字符串 -如果哈希函数具有极高的熵,而事实并非如此,那就行了。

Perhaps you are thinking of a cryptographic hash function , where 也许您在考虑加密散列函数

a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value 对消息进行很小的更改就应该广泛更改哈希值,以使新的哈希值看起来与旧的哈希值不相关

In which case, Object.hashCode is not designed for cryptographic purposes. 在这种情况下, Object.hashCode并非设计用于加密目的。

See also How secure is Java's hashCode()? 另请参见Java的hashCode()有多安全?

Their hash codes are 他们的哈希码是

AaAa: ((65 * 31 + 97) * 31 + 65) * 31 + 97 = 2.031.744
AaBB: ((65 * 31 + 97) * 31 + 66) * 31 + 66 = 2.031.744
BBBB: ((66 * 31 + 66) * 31 + 66) * 31 + 66 = 2.031.744

That is just how the math is, nothing to be confused about. 这就是数学的方式,没什么可混淆的。
Note the difference of exactly 31 between 97 and 66, that is what makes these hash codes line up so nicely. 请注意,97和66之间恰好是31的区别,这就是使这些哈希码排列得如此好的原因。

Here is the description from Java documentation of Object#hashCode method: 这是来自Java文档中的Object#hashCode方法的描述:

Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified.This integer need not remain consistent from one execution of an application to another execution of the same application. 在Java应用程序执行期间,只要在同一对象上多次调用它, hashCode方法就必须一致地返回相同的整数,前提是不修改该对象的equals比较中使用的信息。此整数不必与一个整数保持一致一个应用程序的执行到同一应用程序的另一个执行。

If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result. 如果根据equals(Object)方法,两个对象相等,则在两个对象中的每个对象上调用hashCode方法必须产生相同的整数结果。

It is not required that if two objects are unequal according to the java.lang.Object#equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. 根据java.lang.Object#equals(java.lang.Object)方法,如果两个对象不相等,则不需要在两个对象中的每个对象上调用hashCode方法必须产生不同的整数结果。 However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables. 但是,程序员应该意识到,为不相等的对象生成不同的整数结果可能会提高哈希表的性能。

So,the implementation of String class also maintain the above characteristics.So this is a normal phenomenon. 因此, String类的实现也保持了上述特征。这是正常现象。

There are several types of hash functions with different design and performance criteria. 有几种类型的哈希函数具有不同的设计和性能标准。

  1. Hash functions used for indexing such as associative arrays and similar usages can have frequent collisions with no problem because a hash table code will then handle that in some namer such as putting them in lists or re-hashing. 用于索引的哈希函数(例如关联数组和类似用法)可能会发生频繁冲突而不会出现问题,因为哈希表代码将随后以某种名称处理该代码,例如将其放入列表或重新哈希。 Here it is all about performance in time. 这就是时间的性能。 The Java hash() seems to be of this type Java hash()似乎是这种类型的

  2. Another type of function, a cryptographic hash such as SHA*, strive to avoid collisions at the expense of hashing performance. 另一类函数,例如SHA *等加密哈希,以哈希性能为代价,努力避免冲突。

  3. Yet a third type of hash functions is a password verifier hash which is designed to be very slow (~100ms is common) and may require large amounts of memory and not-to-frequent collisions are not a concern. 第三种散列函数是密码验证器散列,密码散列被设计为非常慢(通常为100毫秒左右),并且可能需要大量内存,因此不会出现频繁冲突。 The point here is to make brute force attacks take so long as to be infeasible. 这里的目的是要使蛮力攻击花费尽可能长的时间以至于不可行。

Once choses the type and characteristics of hashes based on usage. 根据用途选择哈希的类型和特征。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM