简体   繁体   English

散列数值的最佳算法?

[英]Best algorithm for hashing number values?

When dealing with a series of numbers, and wanting to use hash results for security reasons, what would be the best way to generate a hash value from a given series of digits? 当处理一系列数字,并且出于安全原因想要使用哈希结果时,从给定的一系列数字生成哈希值的最佳方法是什么? Examples of input would be credit card numbers, or bank account numbers. 输入的示例是信用卡号或银行帐号。 Preferred output would be a single unsigned integer to assist in matching purposes. 首选输出将是单个无符号整数,以帮助匹配目的。

My feeling is that most of the string implementations appear to have low entropy when run against such a short range of characters and because of that, the collision rate might be higher than when run against a larger sample. 我的感觉是,当针对如此短的字符范围运行时,大多数字符串实现似乎具有低熵,并且因此,碰撞率可能高于针对较大样本运行时的碰撞率。

The target language is Delphi, however answers from other languages are welcome if they can provide a mathmatical basis which can lead to an optimal solution. 目标语言是Delphi,但是如果它们可以提供可以导致最佳解决方案的数学基础,则欢迎来自其他语言的答案。

The purpose of this routine will be to determine if a previously received card/account was previously processed or not. 此例程的目的是确定先前收到的卡/帐户是否先前已处理过。 The input file could have multiple records against a database of multiple records so performance is a factor. 输入文件可能具有针对多个记录的数据库的多个记录,因此性能是一个因素。

With security questions all the answers lay on a continuum from most secure to most convenient . 对于安全问题,所有答案都是从最安全最方便连续统一体 I'll give you two answers, one that is very secure, and one that is very convenient. 我会给你两个答案,一个非常安全,一个非常方便。 Given that and the explanation of each you can choose the best solution for your system. 鉴于此以及每个解释,您可以为您的系统选择最佳解决方案。

You stated that your objective was to store this value in lieu of the actual credit card so you could later know if the same credit card number is used again. 您声明您的目标是存储此值以代替实际信用卡,以便您稍后可以知道是否再次使用相同的信用卡号。 This means that it must contain only the credit card number and maybe a uniform salt. 这意味着它必须只包含信用卡号码,并且可能包含均匀的盐。 Inclusion of the CCV, expiration date, name, etc. would render it useless since it the value could be different with the same credit card number. 包含CCV,到期日期,名称等将使其无用,因为它可能与相同的信用卡号不同。 So we will assume you pad all of your credit card numbers with the same salt value that will remain uniform for all entries. 因此,我们假设您使用相同的盐值填充所有信用卡号,这些盐值对于所有条目都保持一致。

The convenient solution is to use a FNV (As Zebrabox and Nick suggested). 方便的解决方案是使用FNV (As Zebrabox和Nick建议)。 This will produce a 32 bit number that will index quickly for searches. 这将产生一个32位数字,可以快速索引搜索。 The downside of course is that it only allows for at max 4 billion different numbers, and in practice will produce collisions much quicker then that. 当然,缺点是它只允许最多40亿个不同的数字,并且在实践中会产生更快的碰撞。 Because it has such a high collision rate a brute force attack will probably generate enough invalid results as to make it of little use. 因为它具有如此高的碰撞率,所以蛮力攻击可能会产生足够的无效结果,使其几乎没有用处。

The secure solution is to rely on SHA hash function (the larger the better), but with multiple iterations. 安全的解决方案是依赖SHA哈希函数(越大越好),但需要多次迭代。 I would suggest somewhere on the order of 10,000. 我会建议大约10,000的地方。 Yes I know, 10,000 iterations is a lot and it will take a while, but when it comes to strength against a brute force attack speed is the enemy. 是的,我知道,10,000次迭代很多,而且需要一段时间,但是当谈到强大对抗蛮力时,攻击速度就是敌人。 If you want to be secure then you want it to be SLOW. 如果你想要安全,那么你希望它是缓慢的。 SHA is designed to not have collisions for any size of input. SHA旨在不会出现任何大小的输入冲突。 If a collision is found then the hash is considered no longer viable. 如果发现冲突,则认为散列不再可行。 AFAIK the SHA-2 family is still viable. AFAIK SHA-2系列仍然可行。

Now if you want a solution that is secure and quick to search in the DB, then I would suggest using the secure solution (SHA-2 x 10K) and then storing the full hash in one column, and then take the first 32 bits and storing it in a different column, with the index on the second column. 现在,如果您想要一个安全且快速搜索数据库的解决方案,那么我建议使用安全解决方案(SHA-2 x 10K),然后将完整哈希存储在一列中,然后取前32位,将其存储在不同的列中,索引位于第二列。 Perform your look-up on the 32 bit value first. 首先对32位值进行查找。 If that produces no matches then you have no matches. 如果没有产生匹配则没有匹配。 If it does produce a match then you can compare the full SHA value and see if it is the same. 如果它确实产生匹配,那么您可以比较完整的SHA值并查看它是否相同。 That means you are performing the full binary comparison (hashes are actually binary, but only represented as strings for easy human reading and for transfer in text based protocols) on a much smaller set. 这意味着您正在执行完整的二进制比较(哈希实际上是二进制,但仅表示为字符串,以便于人类阅读和基于文本的协议中的传输)在更小的集合上。

If you are really concerned about speed then you can reduce the number of iterations. 如果你真的关心速度,那么你可以减少迭代次数。 Frankly it will still be fast even with 1000 iterations. 坦率地说,即使进行1000次迭代,它仍然会很快。 You will want to make some realistic judgment calls on how big you expect the database to get and other factors (communication speed, hardware response, load, etc.) that may effect the duration. 您将需要对您期望数据库获得的大小以及可能影响持续时间的其他因素(通信速度,硬件响应,负载等)做出一些现实的判断调用。 You may find that your optimizing the fastest point in the process, which will have little to no actual impact. 您可能会发现您优化了流程中的最快点 ,这几乎没有实际影响。

Also, I would recommend that you benchmark the look-up on the full hash vs. the 32 bit subset. 另外,我建议您对完整哈希与32位子集的查找进行基准测试 Most modern database system are fairly fast and contain a number of optimizations and frequently optimize for us doing things the easy way. 大多数现代数据库系统都相当快,并且包含许多优化,并且经常针对我们以简单的方式进行优化。 When we try to get smart we sometimes just slow it down. 当我们试图变得聪明时,我们有时会放慢速度。 What is that quote about premature optimization . 什么是关于过早优化的引用。 . . ?

This seems to be a case for key derivation functions . 这似乎是密钥派生函数的一种情况。 Have a look at PBKDF2 . 看看PBKDF2

Just using cryptographic hash functions (like the SHA family) will give you the desired distribution, but for very limited input spaces (like credit card numbers) they can be easily attacked using brute force because this hash algorithms are usually designed to be as fast as possible. 只使用加密哈希函数(如SHA系列)将为您提供所需的分布,但对于非常有限的输入空间(如信用卡号),它们可以使用强力攻击轻松攻击,因为这种哈希算法通常设计得与可能。

UPDATE UPDATE

Okay, security is no concern for your task. 好的,安全性不关心您的任务。 Because you have already a numerical input, you could just use this (account) number modulo your hash table size. 因为您已经有了数字输入,所以您可以使用这个(帐户)数量模拟您的哈希表大小。 If you process it as string, you might indeed encounter a bad distribution, because the ten digits form only a small subset of all possible characters. 如果将其作为字符串处理,则可能确实会遇到错误的分布,因为十个数字只构成所有可能字符的一小部分。

Another problem is probably that the numbers form big clusters of assigned (account) numbers with large regions of unassigned numbers between them. 另一个问题可能是这些数字形成了大的已分配(帐户)数字集群,它们之间有大量未分配的数字区域。 In this case I would suggest to try highly non-linear hash function to spread this clusters. 在这种情况下,我建议尝试高度非线性哈希函数来传播这个集群。 And this brings us back to cryptographic hash functions. 这将我们带回到加密哈希函数。 Maybe good old MD5. 也许好老MD5。 Just split the 128 bit hash in four groups of 32 bits, combine them using XOR, and interpret the result as a 32 bit integer. 只需将128位散列分成四组32位,使用XOR组合它们,并将结果解释为32位整数。

While not directly related, you may also have a look at Benford's law - it provides some insight why numbers are usually not evenly distributed. 虽然没有直接相关,但您也可以查看本福德定律 - 它提供了一些有关数字通常不均匀分布的见解。

如果需要安全性,请使用加密安全散列,例如SHA-256。

If performance is a factor I suggest to take a look at a CodeCentral entry of Peter Below. 如果性能是一个因素,我建议你看一下Peter Below的CodeCentral条目 It performs very well for large number of items. 它对大量物品表现非常好。

By default it uses PJ Weinberger ELF hashing function . 默认情况下,它使用PJ Weinberger ELF 散列函数 But others are also provided. 但也提供了其他人。

I needed to look deeply into hash functions a few months ago. 几个月前我需要深入研究哈希函数。 Here are some things I found. 以下是我发现的一些事情。

You want the hash to spread out hits evenly and randomly throughout your entire target space (usually 32 bits, but could be 16 or 64-bits.) You want every character of the input to have and equally large effect on the output. 您希望散列在整个目标空间中均匀且随机地分布命中(通常为32位,但可以是16位或64位。)您希望输入的每个字符对输出具有同样大的影响。

ALL the simple hashes (like ELF or PJW) that simply loop through the string and xor in each byte with a shift or a mod will fail that criteria for a simple reason: The last characters added have the most effect. 所有简单的哈希(如ELF或PJW)只需循环遍历字符串,每个字节中的xor带有shift或mod将失败该条件,原因很简单:添加的最后一个字符效果最好。

But there are some really good algorithms available in Delphi and asm. 但是在Delphi和asm中有一些非常好的算法可用。 Here are some references: 以下是一些参考:

See 1997 Dr. Dobbs article at burtleburtle.net/bob/hash/doobs.html 请参阅1997年在Drtbs博士的文章:burtleburtle.net/bob/hash/doobs.html
code at burtleburtle.net/bob/c/lookup3.c burtleburtle.net/bob/c/lookup3.c上的代码

SuperFastHash Function c2004-2008 by Paul Hsieh (AKA HsiehHash) Paul Hsieh(AKA HsiehHash)的SuperFastHash功能c2004-2008
www.azillionmonkeys.com/qed/hash.html www.azillionmonkeys.com/qed/hash.html

You will find Delphi (with optional asm) source code at this reference: 您将在此参考中找到Delphi(带有可选的asm)源代码:
http://landman-code.blogspot.com/2008/06/superfasthash-from-paul-hsieh.html http://landman-code.blogspot.com/2008/06/superfasthash-from-paul-hsieh.html
13 July 2008 2008年7月13日
"More than a year ago Juhani Suhonen asked for a fast hash to use for his hashtable. I suggested the old but nicely performing elf-hash, but also noted a much better hash function I recently found. It was called SuperFastHash (SFH) and was created by Paul Hsieh to overcome his 'problems' with the hash functions from Bob Jenkins. Juhani asked if somebody could write the SFH function in basm. A few people worked on a basm implementation and posted it." “一年多以前,Juhani Suhonen要求快速哈希用于他的哈希表。我建议使用旧的但表现良好的elf-hash,但也注意到我最近发现的更好的哈希函数。它被称为SuperFastHash(SFH)和是由Paul Hsieh创建的,用Bob Jenkins的哈希函数来克服他的'问题'.Juhani问是否有人可以在basm中编写SFH函数。一些人参与了一个basm实现并发布它。“

The Hashing Saga Continues: Hashing Saga继续:
2007-03-13 Andrew: When Bad Hashing Means Good Caching 2007-03-13安德鲁:当坏哈希意味着好缓存
www.team5150.com/~andrew/blog/2007/03/hash_algorithm_attacks.html www.team5150.com/~andrew/blog/2007/03/hash_algorithm_attacks.html
2007-03-29 Andrew: Breaking SuperFastHash 2007-03-29安德鲁:打破SuperFastHash
floodyberry.wordpress.com/2007/03/29/breaking-superfasthash/ floodyberry.wordpress.com/2007/03/29/breaking-superfasthash/
2008-03-03 Austin Appleby: MurmurHash 2.0 2008-03-03 Austin Appleby:MurmurHash 2.0
murmurhash.googlepages.com/ murmurhash.googlepages.com/
SuperFastHash - 985.335173 mb/sec SuperFastHash - 985.335173 mb / sec
lookup3 - 988.080652 mb/sec lookup3 - 988.080652 mb / sec
MurmurHash 2.0 - 2056.885653 mb/sec MurmurHash 2.0 - 2056.885653 mb / sec
Supplies c++ code MurmurrHash2.cpp and aligned-read-only implementation - 提供c ++代码MurmurrHash2.cpp和align-read-only实现 -
MurmurHashAligned2.cpp MurmurHashAligned2.cpp
//======================================================================== // ================================================ ========================
// Here is Landman's MurmurHash2 in C# //这是Landman的MurmurHash2 in C#
//2009-02-25 Davy Landman does C# implimentations of SuperFashHash and MurmurHash2 // 2009-02-25 Davy Landman做了SuperFashHash和MurmurHash2的C#实施
//landman-code.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00%2B01%3A00&updated-max=2010-01-01T00%3A00%3A00%2B01%3A00&max-results=2 //landman-code.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00%2B01%3A00&updated-max=2010-01-01T00%3A00%3A00%2B01%3A00&max-results=2
// //
//Landman impliments both SuperFastHash and MurmurHash2 4 ways in C#: // Landman在C#中使用SuperFastHash和MurmurHash2四种方式:
//1: Managed Code 2: Inline Bit Converter 3: Int Hack 4: Unsafe Pointers // 1:托管代码2:内联位转换器3:Int Hack 4:不安全指针
//SuperFastHash 1: 281 2: 780 3: 1204 4: 1308 MB/s // SuperFastHash 1:281 2:780 3:1204 4:1308 MB / s
//MurmurHash2 1: 486 2: 759 3: 1430 4: 2196 // MurmurHash2 1:486 2:759 3:1430 4:2196

Sorry if the above turns out to look like a mess. 对不起,如果以上结果看起来像一团糟。 I had to just cut&paste it. 我不得不剪切并粘贴它。

At least one of the references above gives you the option of getting out a 64-bit hash, which would certainly have no collisions in the space of credit card numbers, and could be easily stored in a bigint field in MySQL. 至少有一个上面的参考文献为您提供了一个64位散列的选项,它肯定不会在信用卡号码空间中发生冲突,并且可以很容易地存储在MySQL的bigint字段中。

You do not need a cryptographic hash. 您不需要加密哈希。 They are much more CPU intensive. 它们的CPU密集程度更高。 And the purpose of "cryptographic" is to stop hacking, not to avoid collisions. 而“加密”的目的是阻止黑客攻击,而不是避免冲突。

By definition, a cryptographic hash will work perfectly for your use case. 根据定义,加密哈希将完美适用于您的用例。 Even if the characters are close, the hash should be nicely distributed. 即使字符很接近,哈希也应该很好地分布。

So I advise you to use any cryptographic hash (SHA-256 for example), with a salt. 所以我建议你使用任何加密哈希(例如SHA-256)和盐。

For a non cryptographic approach you could take a look at the FNV hash it's fast with a low collision rate. 对于非加密方法,您可以使用低冲突率快速查看FNV哈希

As a very fast alternative, I've also used this algorithm for a few years and had few collision issues however I can't give you a mathematical analysis of it's inherent soundness but for what it's worth here it is 作为一个非常快速的替代方案,我也使用了这个算法几年并且几乎没有碰撞问题但是我不能给你一个数学分析它的固有的健全性,但是它的价值在这里它是

=Edit - My code sample was incorrect - now fixed = =编辑 - 我的代码示例不正确 - 现在修复=

In c/c++ 在c / c ++中

unsigned int Hash(const char *s)
{
    int hash = 0;

    while (*s != 0)
    {
        hash *= 37;
            hash += *s;
        s++;
    }

    return hash;
}

Note that '37' is a magic number, so chosen because it's prime 请注意,'37'是一个幻数,所以选择它是因为它是素数

Best hash function for the natural numbers let 允许自然数的最佳哈希函数

 f(n)=n

No conflicts ;) 没有冲突;)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM