简体   繁体   中英

What is the less expensive hash algorithm?

I don't know so much in hash algorithms.

I need to compute the hash of an incoming file live in Java before forwarding the file a remote system (a bit like S3) which requires a file hash in MD2/MD5/SHA-X. This hash is not computed for security reasons but simply for a consistency checksum.

I am able to compute this hash live while forwarding the file, with a DigestInputStream of Java standard library, but would like to know which algorithm is the best to use to avoid performance problems of using the DigestInputStream?

One of my former collegue tested and told us that computing the hash live can be quite expensive compared to an unix command line or on a file.


Edit about premature optimization: I work an a company which targets to help other companies to dematerialize their documents. This means we have a batch which handle document transfers from other companies. We target in the future millions of document per days and actually, the execution time of this batch is sensitive for our business.

An hashing optimisation of 10 milliseconds for 1 million document per day is a daily execution time reduced of 3 hours which is pretty huge.

If you simply want to detect accidental corruption during transmission, etc, then a simple (non-crypto) checksum should be sufficient. But note that (for example) a 16 bit checksum will fail to detect random corruption one time in 2 16 . And it is no guard against someone deliberately modifying the data.

The Wikipedia page on Checksums , lists various options including a number of commonly used (and cheap) ones like Adler-32 and CRCs.

However, I agree with @ppeterka. This smells of "premature optimization".

I know that lot of people do not believe in micro benchmark but let me post the result what I've got.

Input:

bigFile.txt = appx 143MB size

hashAlgorithm = MD2, MD5, SHA-1

test code:

       while (true){
            long l = System.currentTimeMillis();
            MessageDigest md = MessageDigest.getInstance(hashAlgorithm);
            try (InputStream is = new BufferedInputStream(Files.newInputStream(Paths.get("bigFile.txt")))) {
                DigestInputStream dis = new DigestInputStream(is, md);
                int b;
                while ((b = dis.read()) != -1){
                }
            }
            byte[] digest = md.digest();
            System.out.println(System.currentTimeMillis() - l);
        }

results:

MD5
------
22030
10356
9434
9310
11332
9976
9575
16076
-----

SHA-1
-----
18379
10139
10049
10071
10894
10635
11346
10342
10117
9930
-----

MD2
-----
45290
34232
34601
34319
-----

Seems that MD2 a bit slower that MD5 or SHA-1

Like NKukhar I've tried to do a micro-benchmark, but with a different code and better results:

public static void main(String[] args) throws Exception {
    String bigFile = "100mbfile";


    // We put the file bytes in memory, we don't want to mesure the time it takes to read from the disk
    byte[] bigArray = IOUtils.toByteArray(Files.newInputStream(Paths.get(bigFile)));
    byte[] buffer = new byte[50_000]; // the byte buffer we will use to consume the stream

    // we prepare the algos to test
    Set<String> algos = ImmutableSet.of(
            "no_hash", // no hashing
            MessageDigestAlgorithms.MD5,
            MessageDigestAlgorithms.SHA_1,
            MessageDigestAlgorithms.SHA_256,
            MessageDigestAlgorithms.SHA_384,
            MessageDigestAlgorithms.SHA_512
    );

    int executionNumber = 20;

    for ( String algo : algos ) {
      long totalExecutionDuration = 0;
      for ( int i = 0 ; i < 20 ; i++ ) {
        long beforeTime = System.currentTimeMillis();
        InputStream is = new ByteArrayInputStream(bigArray);
        if ( !"no_hash".equals(algo) ) {
          is = new DigestInputStream(is, MessageDigest.getInstance(algo));
        }
        while ((is.read(buffer)) != -1) {  }
        long executionDuration = System.currentTimeMillis() - beforeTime;
        totalExecutionDuration += executionDuration;
      }
      System.out.println(algo + " -> average of " + totalExecutionDuration/executionNumber + " millies per execution");
    }
  }

This produces the following output for a 100mb file on a good i7 developer machine:

no_hash -> average of 6 millies per execution
MD5 -> average of 201 millies per execution
SHA-1 -> average of 335 millies per execution
SHA-256 -> average of 576 millies per execution
SHA-384 -> average of 481 millies per execution
SHA-512 -> average of 464 millies per execution

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM