简体   繁体   中英

Consistency of hashCode() on a Java string

The hashCode value of a Java String is computed as ( String.hashCode() ):

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Are there any circumstances (say JVM version, vendor, etc.) under which the following expression will evaluate to false?

boolean expression = "This is a Java string".hashCode() == 586653468

If you claim that the answer is "yes, there are such circumstances" - then please give a concrete example of when "This is a Java string".hashCode() != 586653468. Try to be as specific/concrete as possible. 如果您声称答案是“是,有这种情况” - 那么请举一个具体的例子,说明“这是一个Java字符串”.hashCode()!= 586653468。尝试具体/具体尽可能。

We all know that relying on the implementation details of hashCode() is bad in general. 我们都知道,依赖hashCode()的实现细节通常很糟糕。 However, I'm talking specifically about String.hashCode() - so please keep the answer focused to String.hashCode(). Object.hashCode() is totally irrelevant in the context of this question.

I can see that documentation as far back as Java 1.2.

While it's true that in general you shouldn't rely on a hash code implementation remaining the same, it's now documented behaviour for java.lang.String , so changing it would count as breaking existing contracts.

Wherever possible, you shouldn't rely on hash codes staying the same across versions etc - but in my mind java.lang.String is a special case simply because the algorithm has been specified... so long as you're willing to abandon compatibility with releases before the algorithm was specified, of course.

I found something about JDK 1.0 and 1.1 and >= 1.2:

In JDK 1.0.x and 1.1.x the hashCode function for long Strings worked by sampling every nth character. This pretty well guaranteed you would have many Strings hashing to the same value, thus slowing down Hashtable lookup. In JDK 1.2 the function has been improved to multiply the result so far by 31 then add the next character in sequence. This is a little slower, but is much better at avoiding collisions. Source: http://mindprod.com/jgloss/hashcode.html

Something different, because you seem to need a number: How about using CRC32 or MD5 instead of hashcode and you are good to go - no discussions and no worries at all...

You should not rely on a hash code being equal to a specific value. Just that it will return consistent results within the same execution. The API docs say the following :

The general contract of hashCode is:

  • Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

EDIT Since the javadoc for String.hashCode() specifies how a String's hash code is computed, any violation of this would violate the public API specification.

As said above, in general you should not rely on the hash code of a class remaining the same. Note that even subsequent runs of the same application on the same VM may produce different hash values. AFAIK the Sun JVM's hash function calculates the same hash on every run, but that's not guaranteed.

Note that this is not theoretical. The hash function for java.lang.String was changed in JDK1.2 (the old hash had problems with hierarchical strings like URLs or file names, as it tended to produce the same hash for strings which only differed at the end).

java.lang.String is a special case, as the algorithm of its hashCode() is (now) documented, so you can probably rely on that. I'd still consider it bad practice. If you need a hash algorithm with special, documented properties, just write one :-).

Another (!) issue to worry about is the possible change of implementation between early/late versions of Java. I don't believe the implementation details are set in stone, and so potentially an upgrade to a future Java version could cause problems.

Bottom line is, I wouldn't rely on the implementation of hashCode() .

Perhaps you can highlight what problem you're actually trying to solve by using this mechanism, and that will highlight a more suitable approach.

Just to answer your question and not to continue any discussions. The Apache Harmony JDK implementation seems to use a different algorithm, at least it looks totally different:

Sun JDK

public int hashCode() {
    int h = hash;
    if (h == 0) {
        int off = offset;
        char val[] = value;
        int len = count;

        for (int i = 0; i < len; i++) {
            h = 31*h + val[off++];
        }
        hash = h;
    }
    return h;
}

Apache Harmony

public int hashCode() {
    if (hashCode == 0) {
        int hash = 0, multiplier = 1;
        for (int i = offset + count - 1; i >= offset; i--) {
            hash += value[i] * multiplier;
            int shifted = multiplier << 5;
            multiplier = shifted - multiplier;
        }
        hashCode = hash;
    }
    return hashCode;
}

Feel free to check it yourself...

如果您担心更改以及可能不兼容的VM,只需将现有的哈希码实现复制到您自己的实用程序类中,然后使用它来生成哈希码。

The hashcode will be calculated based on the ASCII values of the characters in the String.

This is the implementation in the String Class is as follows

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        hash = h = isLatin1() ? StringLatin1.hashCode(value)
                              : StringUTF16.hashCode(value);
    }
    return h;
}

Collisions in hashcode are unavoidable. For example, the strings "Ea" and "FB" give the same hashcode as 2236

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM