简体   繁体   中英

Better 64-bit byte array hash

I need a hash algorithm that produces a 64-bit hash code ( long ) with fewer collisions than String.GetHashCode() and that is fast (no expensive calls to cryptographic functions). Here's an implementation of FNV which still shows 3% of collisions after testing 2 million random strings. I need this number to be way lower.

void Main()
{
    const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#@$%^&*()_+}{\":?><,./;'[]0123456789\\";
    const int n = 2000000;
    var random = new Random();
    var hashes = new HashSet<long>();
    int collisions = 0;
    for(int i = 0; i < n; i++)
    {
        var len = random.Next(chars.Length);
        var str = new char[len];
        for (int j = 0; j < len; j++)
        {
            str[j] = chars[random.Next(chars.Length)];
        }
        var s = new String(str);
        if(!hashes.Add(Get64BitHash( s ))) collisions++;
    }
    Console.WriteLine("Collision Percentage after " + n + " random strings: " + ((double)collisions * 100 / n));
}


public long Get64BitHash(string str)
{
  unchecked
  {
     byte[] data = new byte[str.Length * sizeof(char)];
     System.Buffer.BlockCopy(str.ToCharArray(), 0, data, 0, data.Length);

     const ulong p = 1099511628211UL;
     var hash = 14695981039346656037UL;
     foreach(var d in data)
     {
        hash ^= d;
        hash *= p;
     }
     return (long) hash;
  }
}

SAMPLE OUTPUT OF ABOVE CODE:

Collision Percentage after 2000000 random strings: 3.01485 %

3% is the same collision percentage as just calling String.GetHashCode() . I need something way better.

PS: There's a chance I am doing something terribly long.

EDIT : Solved . Get64BitHash method above is correct. The problem was that my strings weren't random. After making sure strings are unique (see revised code below), I get zero collisions on almost 50 million unique strings, versus ~1% collisions using String.GetHashCode() .

void Main()
{
    const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#@$%^&*()_+}{\":?><,./;'[]0123456789\\";
    const int n = 200000000;
    var random = new Random();
    var hashes = new HashSet<long>();
    var strings = new HashSet<string>();
    int collisions = 0;
    while(strings.Count < n)
    {
        var len = random.Next(chars.Length);
        var str = new char[len];
        for (int j = 0; j < len; j++)
        {
            str[j] = chars[random.Next(chars.Length)];
        }
        var s = new String(str);
        if(!strings.Add(s)) continue;
        if(!hashes.Add(s.GetHashCode())) collisions++;
    }
    Console.WriteLine("Collision Percentage after " + n + " random strings: " + ((double)collisions * 100 / strings.Count));
}

The problem is your strings aren't random. Test your string before hashing it a second time.

3% is the same collision percentage as just calling String.GetHashCode()

Maybe that is the theoretical optimum. The built-in hash code is not bad. Try it with SHA2 to confirm that this is the best you can do.

Since your test strings are random the hash codes are probably well distributed as well.

Optimize the function by not creating two temporary buffers that do not seem to serve any purpose. Just access the chars directly ( str[0] ). That way you save the copy and process two bytes per iteration.

You should count the real Hash collisions, because most of them result from colliding strings.

Declare the following :

var hashesString = new HashSet<string>();
int collisionsString = 0 ;
int testedCollisions = 0 ;

Then modify your code as follow:

   if(hashesString.Add(s))
   { // Count collisions only for new strings
     testedCollisions++ ;
     if (!hashes.Add(Get64BitHash( s ))) collisions++;
   }
 }
 Console.WriteLine("Collision Percentage after " + testedCollisions + " random strings: " + ((double)collisions * 100 / testedCollisions));

I did a run with the updated code and got no real collisions (just 60 000 duplicated strings).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM