简体繁体 English

使用 Java 获取唯一文件哈希的最快方法是什么？

[英]What is the fastest way to get unique file hash using Java?

原文 2022-06-21 20:09:40 0 1 java/ hash/ filehash

I want to write a program for personal use that walks the file tree of all of my volumes for the purpose of finding duplicate files.我想编写一个供个人使用的程序，它遍历我所有卷的文件树，以查找重复文件。 I know there are programs out there that do this, but none do it the way I want to do it, and few seem to ever employ file hashing as a check for accuracy.我知道有一些程序可以做到这一点，但没有一个程序能按照我想要的方式去做，而且似乎很少有人使用文件散列来检查准确性。 Probably because hashing takes time.可能是因为散列需要时间。

While I walk the file trees, I will be storing three pieces of information in a mySQL database, which will be:当我遍历文件树时，我将在 mySQL 数据库中存储三条信息，它们是：

Full file path完整文件路径
File Size文件大小
Hash Signature哈希签名

Because for my purposes, a file will be considered a duplicate if all of these conditions are met:因为出于我的目的，如果满足所有这些条件，文件将被视为重复文件：

The file name is the same文件名是一样的
The file size is the same文件大小是一样的
The hash signature is the same.哈希签名是相同的。

Given the first two conditions being true, condition three does NOT need to be incredibly accurate in terms of hashing algorithms .鉴于前两个条件为真，就哈希算法而言，条件三不需要非常准确。

Once the tree walks are all done, I will then search the database for matching file hashes and then check the other conditions...树遍历完成后，我将在数据库中搜索匹配的文件哈希，然后检查其他条件......

I know that MD5 seems to be the 'defacto-standard' for generating unique file hash signatures, but it is costly as far as time goes, and in my project, I will be generating a hash signature for millions of files and don't want to wait several days for the process to finish.我知道 MD5 似乎是生成唯一文件哈希签名的“事实标准”，但随着时间的推移，它的成本很高，而且在我的项目中，我将为数百万个文件生成哈希签名，而不是想要等待几天才能完成该过程。

So based on my requirements, what would be the fastest way to generate a file hash signature in Java that would be good enough to use as a final validation that the two files are indeed duplicates?因此，根据我的要求，在 Java 中生成文件哈希签名的最快方法是什么，足以用作两个文件确实重复的最终验证？

Thank you谢谢

Update: After some thought and the discussion below, I've decided to slightly alter my method so that I only perform a deeper comparison between files after the first two conditions are met.更新：经过一番思考和下面的讨论，我决定稍微改变我的方法，以便我只在满足前两个条件后对文件进行更深入的比较。 Meaning I'll walk the tree and create the database entries, then do the deeper computations once the filename and the size are equal, and I'll be exploring a checksum method as opposed to hashing.这意味着我将遍历树并创建数据库条目，然后在文件名和大小相等时进行更深入的计算，并且我将探索校验和方法而不是散列。

1 个解决方案

I have recently been researching a similar problem and ended up with a similar set of conditions.我最近一直在研究一个类似的问题，并最终得到了一组类似的条件。 I decided to try MurmurHash3 as it seems purpose built for this application.我决定尝试 MurmurHash3，因为它似乎是专门为这个应用程序构建的。 It is not cryptographically secure, which is not needed in this scenario, but seems to be very light weight.它不是加密安全的，在这种情况下不需要，但似乎很轻。

Apache has an implementation in their commons-codec package. Apache 在他们的commons-codec包中有一个实现。