简体   繁体   中英

string.GetHashCode() uniqueness and collisions

Given two different strings, is it always the case that s.GetHashCode() != s1.GetHashCode() ?

Is it the case that the number of distinct integers is less than the number of distinct strings?

No. Just as a simple thought experiment: How many strings are there (hint: many more than 2 32 and thus how many unique hash codes can there be (hint: 2 32 . See the problem? )

Hash codes are just required to be equal whenever Equals returns that both objects are equal. Furthermore, whenever two hash codes are not equal, then the objects themselves cannot be equal. There is no further requirement, but they should be well-distributed so that hash tables can perform well. So basically it's:

在此输入图像描述

Note the omission of the respective ⇐ variants. It's not an equivalence, just two implications.

To quote the documentation :

A hash function must have the following properties:

  1. If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.

  2. The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.

  3. For the best performance, a hash function must generate a random distribution for all input.

To add to @Joey's statement you principally cannot have the hashcodes always be unequal.

There are 2^32 possible hash codes but infinite input strings.

Hash collisions are guaranteed to happen with enough (2^32 + 1) input values.

In fact, hash collisions are much more common than one might think due to the Birthday Problem . When I did the math a while back for a system that used 64 bit hash codes (which have way more possible hash values than 32-bit hash codes, not just double as one might naively think), with 100 million input values it was very possible that there would be at least 1 hash collision. I think the probability was around 1%.

As far i know Object.GetHashCode() does not provide an hash function over the object (so i suppose Joey's consideration is not correct in this case), it only return an unique index assigned to the object by CLR when the object is created and released when the object is garbage collected.

So you can not have an hashcode duplicate (in the same AppDomain) in a given moment, but you could have a douplicate over the time (the same index may be assigned more than once during application execution).

The question is also discussed here : Default implementation for Object.GetHashCode()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM