什么是較便宜的哈希算法？

Question

我對哈希算法知之甚少。

我需要在將文件轉發到遠程系統（有點像S3）之前用Java計算傳入文件的哈希，這需要MD2 / MD5 / SHA-X中的文件哈希。 出於安全原因，不會計算此哈希，而只是為了一致性校驗和。

我可以使用Java標准庫的DigestInputStream在轉發文件時實時計算此哈希值，但是想知道最好使用哪種算法來避免使用DigestInputStream的性能問題？

我的一位前同事測試並告訴我們，與unix命令行或文件相比，計算hash實時可能非常昂貴。

關於過早優化的編輯：我在一家公司工作，目的是幫助其他公司取消他們的文件。 這意味着我們有一個處理來自其他公司的文件傳輸的批次。 我們將來每天定位數百萬份文檔，實際上，此批次的執行時間對我們的業務非常敏感。

每天100萬份文檔的散列優化10毫秒是每天執行時間縮短3小時，這是非常巨大的。

Answer 1

如果您只是想在傳輸過程中檢測意外損壞等，那么一個簡單的（非加密）校驗和應該就足夠了。 但請注意（例如）16位校驗和將無法在2 ^16中檢測到一次隨機損壞。 並且它無法防止有人故意修改數據。

Checksums上的Wikipedia頁面列出了各種選項，包括Adler-32和CRC等常用（和便宜）的選項。

但是，我同意@ppeterka。 這種氣味“過早優化”。

Answer 2

我知道很多人不相信微基准，但讓我發布我得到的結果。

輸入：

bigFile.txt = appx 143MB size

hashAlgorithm = MD2, MD5, SHA-1

測試代碼：

       while (true){
            long l = System.currentTimeMillis();
            MessageDigest md = MessageDigest.getInstance(hashAlgorithm);
            try (InputStream is = new BufferedInputStream(Files.newInputStream(Paths.get("bigFile.txt")))) {
                DigestInputStream dis = new DigestInputStream(is, md);
                int b;
                while ((b = dis.read()) != -1){
                }
            }
            byte[] digest = md.digest();
            System.out.println(System.currentTimeMillis() - l);
        }

結果：

MD5
------
22030
10356
9434
9310
11332
9976
9575
16076
-----

SHA-1
-----
18379
10139
10049
10071
10894
10635
11346
10342
10117
9930
-----

MD2
-----
45290
34232
34601
34319
-----

似乎MD2比MD5或SHA-1慢一點

Answer 3

像NKukhar一樣，我試圖做一個微基准測試，但使用不同的代碼和更好的結果：

public static void main(String[] args) throws Exception {
    String bigFile = "100mbfile";


    // We put the file bytes in memory, we don't want to mesure the time it takes to read from the disk
    byte[] bigArray = IOUtils.toByteArray(Files.newInputStream(Paths.get(bigFile)));
    byte[] buffer = new byte[50_000]; // the byte buffer we will use to consume the stream

    // we prepare the algos to test
    Set<String> algos = ImmutableSet.of(
            "no_hash", // no hashing
            MessageDigestAlgorithms.MD5,
            MessageDigestAlgorithms.SHA_1,
            MessageDigestAlgorithms.SHA_256,
            MessageDigestAlgorithms.SHA_384,
            MessageDigestAlgorithms.SHA_512
    );

    int executionNumber = 20;

    for ( String algo : algos ) {
      long totalExecutionDuration = 0;
      for ( int i = 0 ; i < 20 ; i++ ) {
        long beforeTime = System.currentTimeMillis();
        InputStream is = new ByteArrayInputStream(bigArray);
        if ( !"no_hash".equals(algo) ) {
          is = new DigestInputStream(is, MessageDigest.getInstance(algo));
        }
        while ((is.read(buffer)) != -1) {  }
        long executionDuration = System.currentTimeMillis() - beforeTime;
        totalExecutionDuration += executionDuration;
      }
      System.out.println(algo + " -> average of " + totalExecutionDuration/executionNumber + " millies per execution");
    }
  }

這會在一台優秀的i7開發者機器上為100mb文件生成以下輸出：

no_hash -> average of 6 millies per execution
MD5 -> average of 201 millies per execution
SHA-1 -> average of 335 millies per execution
SHA-256 -> average of 576 millies per execution
SHA-384 -> average of 481 millies per execution
SHA-512 -> average of 464 millies per execution

什么是較便宜的哈希算法？

問題描述

3 個解決方案

解決方案1
5 已采納 2013-10-03 11:10:47

解決方案2
1 2013-10-03 11:41:54

解決方案3
0 2013-10-03 13:28:31

什么是較便宜的哈希算法？

問題描述

3 個解決方案

解決方案1 5 已采納 2013-10-03 11:10:47

解決方案2 1 2013-10-03 11:41:54

解決方案3 0 2013-10-03 13:28:31

解決方案1
5 已采納 2013-10-03 11:10:47

解決方案2
1 2013-10-03 11:41:54

解決方案3
0 2013-10-03 13:28:31