简体   繁体   中英

What hash algorithm does .net utilise? What about java?

Regarding the HashTable (and subsequent derivatives of such) does anyone know what hashing algorithm .net and Java utilise?

Are List and Dictionary both direct descandents of Hashtable?

The hash function is not built into the hash table; the hash table invokes a method on the key object to compute the hash. So, the hash function varies depending on the type of key object.

In Java, a List is not a hash table (that is, it doesn't extend the Map interface). One could implement a List with a hash table internally (a sparse list, where the list index is the key into the hash table), but such an implementation is not part of the standard Java library.

I know nothing about .NET but I'll attempt to speak for Java.

In Java, the hash code is ultimately a combination of the code returned by a given object's hashCode() function, and a secondary hash function inside the HashMap/ConcurrentHashMap class (interestingly, the two use different functions). Note that Hashtable and Dictionary (the precursors to HashMap and AbstractMap) are obsolete classes. And a list is really just "something else".

As an example, the String class constructs a hash code by repeatedly multiplying the current code by 31 and adding in the next character. See my article on how the String hash function works for more information. Numbers generally use "themselves" as the hash code; other classes, eg Rectangle, that have a combination of fields often use a combination of the String technique of multiplying by a small prime number and adding in, but add in the various field values. (Choosing a prime number means you're unlikely to get "accidental interactions" between certain values and the hash code width, since they don't divide by anything.)

Since the hash table size-- ie the number of "buckets" it has-- is a power of two, a bucket number is derived from the hash code essentially by lopping off the top bits until the hash code is in range. The secondary hash function protects against hash functions where all or most of the randomness is in those top bits, by "spreading the bits around" so that some of the randomness ends up in the bottom bits and doesn't get lopped off. The String hash code would actually work fairly well without this mixing, but user-created hash codes may not work quite so well. Note that if two different hash codes resolve to the same bucket number, Java's HashMap implementations use the "chaining" technique-- ie they create a linked list of entries in each bucket. It's thus important for hash codes to have a good degree of randomness so that items don't cluster into a particular range of buckets. (However, even with a perfect hash function, you will still by law of averages expect some chaining to occur.)

Hash code implementations shouldn't be a mystery. You can look at the hashCode() source for any class you choose.

While looking for the same answer myself, I found this in .net's reference source @ http://referencesource.microsoft.com .

     /*
      Implementation Notes:
      The generic Dictionary was copied from Hashtable's source - any bug 
      fixes here probably need to be made to the generic Dictionary as well.
      This Hashtable uses double hashing.  There are hashsize buckets in the 
      table, and each bucket can contain 0 or 1 element.  We a bit to mark
      whether there's been a collision when we inserted multiple elements
      (ie, an inserted item was hashed at least a second time and we probed 
      this bucket, but it was already in use).  Using the collision bit, we
      can terminate lookups & removes for elements that aren't in the hash
      table more quickly.  We steal the most significant bit from the hash code
      to store the collision bit.

      Our hash function is of the following form:

      h(key, n) = h1(key) + n*h2(key)

      where n is the number of times we've hit a collided bucket and rehashed
      (on this particular lookup).  Here are our hash functions:

      h1(key) = GetHash(key);  // default implementation calls key.GetHashCode();
      h2(key) = 1 + (((h1(key) >> 5) + 1) % (hashsize - 1));

      The h1 can return any number.  h2 must return a number between 1 and
      hashsize - 1 that is relatively prime to hashsize (not a problem if 
      hashsize is prime).  (Knuth's Art of Computer Programming, Vol. 3, p. 528-9)
      If this is true, then we are guaranteed to visit every bucket in exactly
      hashsize probes, since the least common multiple of hashsize and h2(key)
      will be hashsize * h2(key).  (This is the first number where adding h2 to
      h1 mod hashsize will be 0 and we will search the same bucket twice).

      We previously used a different h2(key, n) that was not constant.  That is a 
      horrifically bad idea, unless you can prove that series will never produce
      any identical numbers that overlap when you mod them by hashsize, for all
      subranges from i to i+hashsize, for all i.  It's not worth investigating,
      since there was no clear benefit from using that hash function, and it was
      broken.

      For efficiency reasons, we've implemented this by storing h1 and h2 in a 
      temporary, and setting a variable called seed equal to h1.  We do a probe,
      and if we collided, we simply add h2 to seed each time through the loop.

      A good test for h2() is to subclass Hashtable, provide your own implementation
      of GetHash() that returns a constant, then add many items to the hash table.
      Make sure Count equals the number of items you inserted.

      Note that when we remove an item from the hash table, we set the key
      equal to buckets, if there was a collision in this bucket.  Otherwise
      we'd either wipe out the collision bit, or we'd still have an item in
      the hash table.

       -- 
    */

The HASHING algorithm is the algorithm used to determine the hash code of an item within the HashTable.

The HASHTABLE algorithm (which I think is what this person is asking) is the algorithm the HashTable uses to organize its elements given their hash code.

Java happens to use a chained hash table algorithm.

Anything purporting to be a HashTable or something like it in .NET does not implement its own hashing algorithm: they always call the object-being-hashed's GetHashCode() method.

There is a lot of confusion though as to what this method does or is supposed to do, especially when concerning user-defined or otherwise custom classes that override the base Object implementation .

For .NET, you can use Reflector to see the various algorithms. There is a different one for the generic and non-generic hash table, plus of course each class defines its own hash code formula.

The .NET Dictionary<T> class uses an IEqualityComparer<T> to compute hash codes for keys and to perform comparisons between keys in order to do hash lookups. If you don't provide an IEqualityComparer<T> when constructing the Dictionary<T> instance (it's an optional argument to the constructor) it will create a default one for you, which uses the object.GetHashCode and object.Equals methods by default.

As for how the standard GetHashCode implementation works, I'm not sure it's documented. For specific types you can read the source code for the method in Reflector or try checking the Rotor source code to see if it's there.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM