简体   繁体   中英

What is the fastest way to get unique file hash using Java?

I want to write a program for personal use that walks the file tree of all of my volumes for the purpose of finding duplicate files. I know there are programs out there that do this, but none do it the way I want to do it, and few seem to ever employ file hashing as a check for accuracy. Probably because hashing takes time.

While I walk the file trees, I will be storing three pieces of information in a mySQL database, which will be:

  • Full file path
  • File Size
  • Hash Signature

Because for my purposes, a file will be considered a duplicate if all of these conditions are met:

  • The file name is the same
  • The file size is the same
  • The hash signature is the same.

Given the first two conditions being true, condition three does NOT need to be incredibly accurate in terms of hashing algorithms .

Once the tree walks are all done, I will then search the database for matching file hashes and then check the other conditions...

I know that MD5 seems to be the 'defacto-standard' for generating unique file hash signatures, but it is costly as far as time goes, and in my project, I will be generating a hash signature for millions of files and don't want to wait several days for the process to finish.

So based on my requirements, what would be the fastest way to generate a file hash signature in Java that would be good enough to use as a final validation that the two files are indeed duplicates?

Thank you

Update: After some thought and the discussion below, I've decided to slightly alter my method so that I only perform a deeper comparison between files after the first two conditions are met. Meaning I'll walk the tree and create the database entries, then do the deeper computations once the filename and the size are equal, and I'll be exploring a checksum method as opposed to hashing.

I have recently been researching a similar problem and ended up with a similar set of conditions. I decided to try MurmurHash3 as it seems purpose built for this application. It is not cryptographically secure, which is not needed in this scenario, but seems to be very light weight.

Apache has an implementation in their commons-codec package.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM