简体   繁体   English

没有冲突的高性能散列?

[英]High-performance hashing without collisions?

The Hashing function below was borrowed heavily from this post , but it has too many collisions in my application.下面的散列 function 是从这篇文章中大量借用的,但在我的应用程序中它有太多的冲突。

public static class Hashing
{
  private const int FNV1a_offsetBias = unchecked( ( int )0x81_1c_9d_c5 );
  private const int FNV1a_prime = 16_777_619;

  public static int FNV1a(params dynamic[] values) {
     var hash = FNV1a_offsetBias;

     foreach ( var value in values )
        hash = FNV1a_Crank(hash, value.GetHashCode());

     return hash;
  }

  private static int FNV1a_Crank(int start, int addendum) {
     unchecked {
        start *= FNV1a_prime;
        start += addendum;
     }

     return start;
  }
}

I need high-performance hashing that is guaranteed unique.我需要保证唯一的高性能散列。 I realize it will likely need to be slower than the function above, but I'm hoping to find something that is not dramatically slower.我意识到它可能需要比上面的 function 慢,但我希望能找到一些不会显着变慢的东西。 The SE post linked above is fascinating and useful, but also leaves me confused and wondering what to use.上面链接的 SE 帖子引人入胜且有用,但也让我感到困惑并想知道该使用什么。

The use case for my hashing is this: I have an app that inserts millions of records every day into my database.我的散列用例是这样的:我有一个应用程序,每天将数百万条记录插入我的数据库。 The tables being inserted into contain unique keys and thus any insert that violates uniqueness will throw an exception.插入的表包含唯一键,因此任何违反唯一性的插入都会引发异常。 I cannot allow these exceptions to be thrown because it's far too slow, and it's just better to avoid for other reasons.我不能允许抛出这些异常,因为它太慢了,而且出于其他原因最好避免。 So I use the function above to hash the column values in the composite unique key of each insert and store in a hash table.所以我使用上面的 function 到 hash 每个插入的复合唯一键中的列值,并存储在 hash 表中。 Before each insert, I generate a hash and look for the hash in the hashtable.在每次插入之前,我生成一个 hash 并在哈希表中查找 hash。 If it's not there, then I'm safe to do the insert.如果它不存在,那么我可以安全地进行插入。 If it is there, the record already exists, and I skip the insert.如果它在那里,则记录已经存在,我跳过插入。

It's very fast, and I thought it worked at first.它非常快,一开始我认为它有效。 But then I found dozens of cases (out of millions) in which hashes collide and thus my app believes a record had already been inserted--when in fact it hadn't.但后来我发现了几十个案例(数百万个),其中哈希冲突,因此我的应用程序认为已经插入了一条记录——而实际上它没有。 So I get missing records, which is unacceptable to the business.所以我会丢失记录,这对企业来说是不可接受的。

Here are a few examples of the sort of data I am hashing:以下是我正在散列的数据类型的一些示例:

Hasher("Z125",  "99-8ZG10", "SpecialZ_S07181_2");
Hasher("G125");
Hasher("G99-76", "F78_XYZ_92323");

So I'm looking for a c# function that provides the fastest possible hashing algorithm that is guaranteed unique.所以我正在寻找一个 c# function ,它提供了保证唯一的最快的散列算法。 In other words, I need a performant way to check millions of times does this record already exist in the table ?换句话说,我需要一种高效的方法来检查该记录是否已经存在于表中数百万次? Hashing seems like the fastest way, but uniqueness is paramount.散列似乎是最快的方法,但唯一性是最重要的。

Any ideas?有任何想法吗?

It appears your goal is to generate a unique identifier for your database records.您的目标似乎是为您的数据库记录生成唯一标识符。 Usually your database system will allow you to set a primary key for your database records, which the system will then ensure is unique across the database.通常,您的数据库系统将允许您为数据库记录设置主键,然后系统将确保该主键在整个数据库中是唯一的。 Such primary keys are generally enough for many applications.这样的主键通常对于许多应用程序来说已经足够了。 However, there are several other things to consider, such as:但是,还有其他一些事情需要考虑,例如:

  • Whether identifiers have to be hard to guess, or merely "look random".标识符是否必须难以猜测,或者仅仅是“看起来随机”。
  • Whether identifiers are the only thing that grants access to the record.标识符是否是唯一授予对记录的访问权限的东西。

The best way to generate unique identifiers will depend on these and other questions, which I give in the section " Unique Random Identifiers ".生成唯一标识符的最佳方法将取决于我在“ 唯一随机标识符”一节中给出的这些问题和其他问题。 You should edit your question post with the answers to the six questions I give in that section;您应该使用我在该部分中给出的六个问题的答案来编辑您的问题帖子; the answers will further suggest what kind of identifiers to use.答案将进一步建议使用什么样的标识符。 However, if you can't tolerate the risk of duplicate identifiers, as in this case, then neither random numbers nor hashes of column values are appropriate as unique identifiers unless the application checks them for uniqueness.但是,如果您不能容忍重复标识符的风险,例如在这种情况下,那么随机数和列值的哈希都不适合作为唯一标识符,除非应用程序检查它们的唯一性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM