简体   繁体   中英

Can I be sure the built-in hash for a given string is always the same?

I am getting a string hash like this:

string content = "a very long string";
int contentHash = content.GetHashCode();

I am then storing the hash into a dictionary as key mapping to another ID. This is useful so I don't have to compare big strings during default dictionary hash computation but I can just fish the ID from the dictionary by key.

Can I be sure that the hash for a given string ("a very long string") will be always the same?

Can I be sure that two different strings won't have the same hash?

Also, if possible, how likely is it to get the same hash for different strings?

Yes, it will be consistent since strings are immutable. However, I think you're misusing the dictionary. You should let the dictionary take the hash of the string for you by using the string as the key. Hashes are not guaranteed to be unique, so you may overwrite one key with another.

Just to add some detail as to where the idea of a changing hashcode may have come from.

As the other answers have rightly said the hashcode for a specific string will always be the same for a specific runtime version. There is no guarantee that a newer runtime might use a different algorithm perhaps for performance reasons.

The String class overrides the default GetHashCode implementation in object.

The default implementation for a reference type in .NET is to allocate a sequential ID (held internally by .NET) and assign it to the object (the objects heap storage has slot for storing this hashcode, it only assigned on the first call to GetHashCode for that object).

Hence creating an instance of a class, assigning it some values then retrieving the hashcode, followed by doing the exact same sequence with the same set of values will yeild different hashcodes. This may be the reason why some have been led to believe that hashcodes can change. In fact though its the instance of a class which is allocated a hashcode once allocated that hashcode does not change for that instance.

Edit : I've just noticed that none of the answers directly reference each of you questions (although I think the answer to them is clear) but just to tidy up:-

Can I be sure that the hash for a given string ("a very long string") will be always the same?

In your usage, yes.

Can I be sure that two different strings won't have the same hash?

No. Two different strings may have the same hash.

Also, if possible, how likely is it to get the same hash for different strings?

The probability is quite low, resulting hash is pretty random from a 4G domain.

Yes it will, that's the purpose of a hash code! It's not guaranteed to be the same between different versions of the runtime tho. More info on MSDN

As others pointed out, the hash will remain constant over time. But why are you hashing a string and then put it as key on a Dictionary? Hashes are not guaranteed to be unique. So you comparisons might be incorrect. Let the Dictionary do it's job. I think the most appropriate collection to this case is a HashSet .

As many others have said, the implementation is dependent on the version of the framework but it also depends on the architecture . The implementation of string.GetHashCode() is dfferent in the x86 and x64 versions of the framework even if they have the same version number.

For example, if you are writing a client/server or .net remoting type of architecture and want to use a string HashCode to stop from downloading a large resource, you can only do this if both are the same version and bitness. Otherwise you should use a different hash -- MD5, SHA etc will work correctly.

The documentation for Object.GetHashCode states

If two objects compare as equal, the GetHashCode method for each object must return the same value.

So you are guaranteed that the hash code will be the same for a given string. However, you aren't guaranteed that it will be unique (there may be other strings that have the same hash code).

You don't have to guess about run-times or versions, just use this CaseInsensitiveStringComparer class that I made in my spare time (you can pass it to the constructor of the dictionary or if you are using .NET 3.5, a HashSet):

/// <summary>
/// StringComparer that is basically the same as StringComparer.OrdinalIgnoreCase, except that the hash code function is improved and guaranteed not to change.
/// </summary>
public class CaseInsensitiveStringComparer : StringComparer
{
    /// <summary>
    /// Compares two strings, ignoring case
    /// </summary>
    /// <param name="x">First string</param>
    /// <param name="y">Second string</param>
    /// <returns>Compare result</returns>
    public override int Compare(string x, string y)
    {
        return StringComparer.OrdinalIgnoreCase.Compare(x, y);
    }

    /// <summary>
    /// Checks if two strings are equal, ignoring case
    /// </summary>
    /// <param name="x">First string</param>
    /// <param name="y">Second string</param>
    /// <returns>True if strings are equal, false if not</returns>
    public override bool Equals(string x, string y)
    {
        return Compare(x, y) == 0;
    }

    /// <summary>
    /// Gets a hash code for a string, ignoring case
    /// </summary>
    /// <param name="obj">String to get hash code for</param>
    /// <returns>Hash code</returns>
    public override int GetHashCode(string obj)
    {
        if (obj == null)
        {
            return 0;
        }
        int hashCode = 5381;
        char c;
        for (int i = 0; i < obj.Length; i++)
        {
            c = obj[i];
            if (char.IsLower(c))
            {
                c = char.ToUpperInvariant(c);
            }
            hashCode = ((hashCode << 5) + hashCode) + c;
        }
        return hashCode;
    }
}

字符串根据其内容进行哈希处理,因此,如果使用默认的GetHashCode,则哈希应该保持不变。

As has already been mentioned you can be sure that a hash for a partiular string will be the same as they are hashed based on content. However you cannot be sure that a particular string will be hashed the same for later versions of the .NET framework as is mentioned here

So I would say that this method is fine if it is being used internally to an application. If you are persisting the value to a data store then it is probably best to roll your own function to ensure that it remains consistent across versions.

Can I be sure that the hash for a given string ("a very long string") will be always the same?

Yes

Can I be sure that two different strings won't have the same hash?

No

Given that there are an infinite number of different strings its just not possible to allocate a different int (32bits which can represent up to 4 billion) number for each.

With just 8 characters tehre are 2^60 different strings. This is infinitely larger than 2^32. Naturally the hashcode of some of these strings must clash.

Two objects with the same hashcode do not have to be equal. To know for sure use the equals method. This is basically the strategy used by a hashmap to determine if keys are equal.

Map.get(String key)

  • Calculate hashcode of key
  • Use modulo to figure out which bucket key belongs too.
  • Loop thru all the entries from that bucket attempting to find a matching key.
  • When a key match is found return that entries' value.

As a side note as maps gain more and more elements it will recreate more buckets and place all the old entries into the new buckets. This helps present the bucket entry list from growing into really really long lists. A map wants many buckets with short lists.

The javadoc for Object.hashcode makes for interesting reading - ive pasted a snippet below.

 The equals method implements an equivalence relation:

* It is reflexive: for any reference value x, x.equals(x) should return true.
* It is symmetric: for any reference values x and y, x.equals(y) should return true if and only if y.equals(x) returns true.
* It is transitive: for any reference values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.
* It is consistent: for any reference values x and y, multiple invocations of x.equals(y) consistently return true or consistently return false, provided no information used in equals comparisons on the object is modified.
* For any non-null reference value x, x.equals(null) should return false. 

The equals method for class Object implements the most discriminating possible equivalence relation on objects; that is, for any reference values x and y, this method returns true if and only if x and y refer to the same object (x==y has the value true).

This is a great example for the evils of premature optimization.

Do you have the output of a profiler or benchmark that tells you String comparison between entries in the same hash bucket is actually causing a performance problem?

Didn't think so. Just use the string itself as a key in the Dictionary. That's how you're supposed to use it.

BTW, there are far, far more different strings than there are different int, so basic logic tells you that it's impossible to have a different hashcode for each different string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM