简体繁体 English

适用于大文件和 512 KB 块的最快、最轻量的散列算法 [C、Linux、MAC、Windows]

[英]Fastest and LightWeight Hashing Algorithm for Large Files & 512 KB Chunks [C,Linux,MAC,Windows]

原文 2012-11-30 13:50:56 5 3 c++/ hash/ hashalgorithm

I'm working on a Project which involves computation of Hashes for Files.我正在研究一个涉及文件哈希计算的项目。 The Project is like a File Backup Service, So when a file gets uploaded from Client to Server, i need to check if that file is already available in the server.该项目就像一个文件备份服务，所以当一个文件从客户端上传到服务器时，我需要检查该文件是否已经在服务器中可用。 I generate a CRC-32 Hash for the file and then send the hash to server to check if it's already available.我为文件生成一个 CRC-32 哈希，然后将哈希发送到服务器以检查它是否已经可用。

If the file is not in server, i used to send the file as 512 KB Chunks[for Dedupe] and i have to calculate hash for this each 512 KB Chunk.如果文件不在服务器中，我曾经将文件作为 512 KB 块[用于 Dedupe] 发送，我必须为每个 512 KB 块计算哈希。 The file sizes may be of few GB's sometimes and multiple clients will connect to the server.文件大小有时可能只有几 GB，并且多个客户端将连接到服务器。 So i really need a Fast and LightWeight Hashing algorithm for files.所以我真的需要一个快速和轻量级的文件散列算法。 Any ideas ..?有任何想法吗 ..？

PS : I have already noticed some Hashing Algorithm questions in StackOverflow, but the answer's not quite comparison of the Hashing Algorithms required exactly for this kind of Task. PS：我已经注意到 StackOverflow 中的一些散列算法问题，但答案并没有完全比较此类任务所需的散列算法。 I bet this will be really useful for a bunch of People.我敢打赌这对一群人来说真的很有用。

3 个解决方案

Actually, CRC32 does not have neither the best speed, neither the best distribution.实际上，CRC32既没有最好的速度，也没有最好的分布。

This is to be expected : CRC32 is pretty old by today's standard, and created in an era when CPU were not 32/64 bits wide nor OoO-Ex, also distribution properties were less important than error detection.这是意料之中的：按照今天的标准，CRC32 已经很老了，它是在 CPU 不是 32/64 位宽也不是 OoO-Ex 的时代创建的，分布特性也没有错误检测重要。 All these requirements have changed since.从那时起，所有这些要求都发生了变化。

To evaluate the speed and distribution properties of hash algorithms, Austin Appleby created the excellent SMHasher package.为了评估哈希算法的速度和分布特性，Austin Appleby 创建了优秀的SMHasher包。 A short summary of results is presented here .此处提供了结果的简短摘要。 I would advise to select an algorithm with a Q.Score of 10 (perfect distribution).我建议选择 Q.Score 为 10（完美分布）的算法。

You say you are using CRC-32 but want a faster hash.您说您正在使用 CRC-32 但想要更快的散列。 CRC-32 is very basic and pretty fast. CRC-32 非常基础而且速度非常快。 I would think the I/O time would be much longer than the hash time.我认为 I/O 时间会比哈希时间长得多。 You also want a hash that will not have collisions.您还需要一个不会发生冲突的哈希。 That is two different files or 512 KB chunks gets the same hash value.即两个不同的文件或 512 KB 块获得相同的哈希值。 You could look at any of the cryptographic hashs like MD5 (do not use for secure applications) or SHA1.您可以查看任何加密哈希，如 MD5（不用于安全应用程序）或 SHA1。

If you are only using CRC-32 to check if a file is a duplicate, you are going to get false duplicates because different files can have the same crc-32.如果您只使用 CRC-32 来检查文件是否重复，您将得到错误的重复，因为不同的文件可能具有相同的 crc-32。 You had better use sha-1, crc-32 and md5 are both too weak.你最好用sha-1，crc-32和md5都太弱了。