简体   繁体   中英

Why the following three strings's hashcode are same?

After reading the source code of JDK, I am still surprised that the strings "AaAa", "AaBB" and "BBBB" have the same hashcode.

The source of JDK is as follows,

int h = hash;
if (h == 0 && value.length > 0) {
    char val[] = value;

    for (int i = 0; i < value.length; i++) {
        h = 31 * h + val[i];
    }
    hash = h;
}
return h;

Anyone could clarify this?

Because that's how the hash code is defined to be calculated for a String :

The hash code for a String object is computed as

 s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] 

So:

  • For AaAa : 65*31^3 + 97*31^2 + 65*31 + 97 = 2031744
  • For AaBB : 65*31^3 + 97*31^2 + 66*31 + 66 = 2031744
  • For BBBB : 66*31^3 + 66*31^2 + 66*31 + 66 = 2031744

Because probability .

There are ~4 billion possible hash codes ( Integer.MIN_VALUE -> Integer.MAX_VALUE ) and basically infinite possible Strings. There are bound to be collisions . In fact, the birthday problem shows us that only ~77,000 strings are required for a high chance of an arbitrary collision - and that would be if the hash function had extremely high entropy, which it doesn't.

Perhaps you are thinking of a cryptographic hash function , where

a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value

In which case, Object.hashCode is not designed for cryptographic purposes.

See also How secure is Java's hashCode()?

Their hash codes are

AaAa: ((65 * 31 + 97) * 31 + 65) * 31 + 97 = 2.031.744
AaBB: ((65 * 31 + 97) * 31 + 66) * 31 + 66 = 2.031.744
BBBB: ((66 * 31 + 66) * 31 + 66) * 31 + 66 = 2.031.744

That is just how the math is, nothing to be confused about.
Note the difference of exactly 31 between 97 and 66, that is what makes these hash codes line up so nicely.

Here is the description from Java documentation of Object#hashCode method:

Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified.This integer need not remain consistent from one execution of an application to another execution of the same application.

If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

It is not required that if two objects are unequal according to the java.lang.Object#equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

So,the implementation of String class also maintain the above characteristics.So this is a normal phenomenon.

There are several types of hash functions with different design and performance criteria.

  1. Hash functions used for indexing such as associative arrays and similar usages can have frequent collisions with no problem because a hash table code will then handle that in some namer such as putting them in lists or re-hashing. Here it is all about performance in time. The Java hash() seems to be of this type

  2. Another type of function, a cryptographic hash such as SHA*, strive to avoid collisions at the expense of hashing performance.

  3. Yet a third type of hash functions is a password verifier hash which is designed to be very slow (~100ms is common) and may require large amounts of memory and not-to-frequent collisions are not a concern. The point here is to make brute force attacks take so long as to be infeasible.

Once choses the type and characteristics of hashes based on usage.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM