简体   繁体   中英

High-performance hashing without collisions?

The Hashing function below was borrowed heavily from this post , but it has too many collisions in my application.

public static class Hashing
{
  private const int FNV1a_offsetBias = unchecked( ( int )0x81_1c_9d_c5 );
  private const int FNV1a_prime = 16_777_619;

  public static int FNV1a(params dynamic[] values) {
     var hash = FNV1a_offsetBias;

     foreach ( var value in values )
        hash = FNV1a_Crank(hash, value.GetHashCode());

     return hash;
  }

  private static int FNV1a_Crank(int start, int addendum) {
     unchecked {
        start *= FNV1a_prime;
        start += addendum;
     }

     return start;
  }
}

I need high-performance hashing that is guaranteed unique. I realize it will likely need to be slower than the function above, but I'm hoping to find something that is not dramatically slower. The SE post linked above is fascinating and useful, but also leaves me confused and wondering what to use.

The use case for my hashing is this: I have an app that inserts millions of records every day into my database. The tables being inserted into contain unique keys and thus any insert that violates uniqueness will throw an exception. I cannot allow these exceptions to be thrown because it's far too slow, and it's just better to avoid for other reasons. So I use the function above to hash the column values in the composite unique key of each insert and store in a hash table. Before each insert, I generate a hash and look for the hash in the hashtable. If it's not there, then I'm safe to do the insert. If it is there, the record already exists, and I skip the insert.

It's very fast, and I thought it worked at first. But then I found dozens of cases (out of millions) in which hashes collide and thus my app believes a record had already been inserted--when in fact it hadn't. So I get missing records, which is unacceptable to the business.

Here are a few examples of the sort of data I am hashing:

Hasher("Z125",  "99-8ZG10", "SpecialZ_S07181_2");
Hasher("G125");
Hasher("G99-76", "F78_XYZ_92323");

So I'm looking for a c# function that provides the fastest possible hashing algorithm that is guaranteed unique. In other words, I need a performant way to check millions of times does this record already exist in the table ? Hashing seems like the fastest way, but uniqueness is paramount.

Any ideas?

It appears your goal is to generate a unique identifier for your database records. Usually your database system will allow you to set a primary key for your database records, which the system will then ensure is unique across the database. Such primary keys are generally enough for many applications. However, there are several other things to consider, such as:

  • Whether identifiers have to be hard to guess, or merely "look random".
  • Whether identifiers are the only thing that grants access to the record.

The best way to generate unique identifiers will depend on these and other questions, which I give in the section " Unique Random Identifiers ". You should edit your question post with the answers to the six questions I give in that section; the answers will further suggest what kind of identifiers to use. However, if you can't tolerate the risk of duplicate identifiers, as in this case, then neither random numbers nor hashes of column values are appropriate as unique identifiers unless the application checks them for uniqueness.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM