FileChannel ByteBuffer和Hashing Files

Question

I built a file hashing method in java that takes input string representation of a filepath+filename and then calculates the hash of that file. 我在java中构建了一个文件哈希方法，它接受filepath+filename输入字符串表示，然后计算该文件的哈希值。 The hash can be any of the native supported java hashing algo's such as MD2 through SHA-512 . 散列可以是任何本机支持的java散列算法，例如MD2到SHA-512 。

I am trying to eek out every last drop of performance since this method is an integral part of a project I'm working on. 我试图找出最后一滴性能，因为这个方法是我正在研究的项目的一个组成部分。 I was advised to try using FileChannel instead of a regular FileInputStream . 我被建议尝试使用FileChannel而不是常规的FileInputStream 。

My original method: 我原来的方法：

    /**
     * Gets Hash of file.
     * 
     * @param file String path + filename of file to get hash.
     * @param hashAlgo Hash algorithm to use. <br/>
     *     Supported algorithms are: <br/>
     *     MD2, MD5 <br/>
     *     SHA-1 <br/>
     *     SHA-256, SHA-384, SHA-512
     * @return String value of hash. (Variable length dependent on hash algorithm used)
     * @throws IOException If file is invalid.
     * @throws HashTypeException If no supported or valid hash algorithm was found.
     */
    public String getHash(String file, String hashAlgo) throws IOException, HashTypeException {
        StringBuffer hexString = null;
        try {
            MessageDigest md = MessageDigest.getInstance(validateHashType(hashAlgo));
            FileInputStream fis = new FileInputStream(file);

            byte[] dataBytes = new byte[1024];

            int nread = 0;
            while ((nread = fis.read(dataBytes)) != -1) {
                md.update(dataBytes, 0, nread);
            }
            fis.close();
            byte[] mdbytes = md.digest();

            hexString = new StringBuffer();
            for (int i = 0; i < mdbytes.length; i++) {
                hexString.append(Integer.toHexString((0xFF & mdbytes[i])));
            }

            return hexString.toString();

        } catch (NoSuchAlgorithmException | HashTypeException e) {
            throw new HashTypeException("Unsuppored Hash Algorithm.", e);
        }
    }

Refactored method: 重构方法：

    /**
     * Gets Hash of file.
     * 
     * @param file String path + filename of file to get hash.
     * @param hashAlgo Hash algorithm to use. <br/>
     *     Supported algorithms are: <br/>
     *     MD2, MD5 <br/>
     *     SHA-1 <br/>
     *     SHA-256, SHA-384, SHA-512
     * @return String value of hash. (Variable length dependent on hash algorithm used)
     * @throws IOException If file is invalid.
     * @throws HashTypeException If no supported or valid hash algorithm was found.
     */
    public String getHash(String fileStr, String hashAlgo) throws IOException, HasherException {

        File file = new File(fileStr);

        MessageDigest md = null;
        FileInputStream fis = null;
        FileChannel fc = null;
        ByteBuffer bbf = null;
        StringBuilder hexString = null;

        try {
            md = MessageDigest.getInstance(hashAlgo);
            fis = new FileInputStream(file);
            fc = fis.getChannel();
            bbf = ByteBuffer.allocate(1024); // allocation in bytes

            int bytes;

            while ((bytes = fc.read(bbf)) != -1) {
                md.update(bbf.array(), 0, bytes);
            }

            fc.close();
            fis.close();

            byte[] mdbytes = md.digest();

            hexString = new StringBuilder();

            for (int i = 0; i < mdbytes.length; i++) {
                hexString.append(Integer.toHexString((0xFF & mdbytes[i])));
            }

            return hexString.toString();

        } catch (NoSuchAlgorithmException e) {
            throw new HasherException("Unsupported Hash Algorithm.", e);
        }
    }

Both return a correct hash, however the refactored method only seems to cooperate on small files. 两者都返回正确的哈希值，但重构的方法似乎只对小文件合作。 When i pass in a large file, it completely chokes out and I can't figure out why. 当我传入一个大文件时，它完全窒息而我无法弄清楚原因。 I'm new to NIO so please advise. 我是NIO新手所以请指教。

EDIT: Forgot to mention I'm throwing SHA-512's through it for testing. 编辑：忘了提到我正在通过它投掷SHA-512进行测试。

UPDATE: Updating with my now current method. UPDATE:使用我现在的方法更新。

    /**
     * Gets Hash of file.
     * 
     * @param file String path + filename of file to get hash.
     * @param hashAlgo Hash algorithm to use. <br/>
     *     Supported algorithms are: <br/>
     *     MD2, MD5 <br/>
     *     SHA-1 <br/>
     *     SHA-256, SHA-384, SHA-512
     * @return String value of hash. (Variable length dependent on hash algorithm used)
     * @throws IOException If file is invalid.
     * @throws HashTypeException If no supported or valid hash algorithm was found.
     */
    public String getHash(String fileStr, String hashAlgo) throws IOException, HasherException {

        File file = new File(fileStr);

        MessageDigest md = null;
        FileInputStream fis = null;
        FileChannel fc = null;
        ByteBuffer bbf = null;
        StringBuilder hexString = null;

        try {
            md = MessageDigest.getInstance(hashAlgo);
            fis = new FileInputStream(file);
            fc = fis.getChannel();
            bbf = ByteBuffer.allocateDirect(8192); // allocation in bytes - 1024, 2048, 4096, 8192

            int b;

            b = fc.read(bbf);

            while ((b != -1) && (b != 0)) {
                bbf.flip();

                byte[] bytes = new byte[b];
                bbf.get(bytes);

                md.update(bytes, 0, b);

                bbf.clear();
                b = fc.read(bbf);
            }

            fis.close();

            byte[] mdbytes = md.digest();

            hexString = new StringBuilder();

            for (int i = 0; i < mdbytes.length; i++) {
                hexString.append(Integer.toHexString((0xFF & mdbytes[i])));
            }

            return hexString.toString();

        } catch (NoSuchAlgorithmException e) {
            throw new HasherException("Unsupported Hash Algorithm.", e);
        }
    }

So I attempted a benchmark hashing out the MD5 of a 2.92GB file using my original example and my latest update's example. 所以我尝试使用我的原始示例和我最新的更新示例，对2.92GB文件的MD5进行基准测试。 Of course any benchmark is relative since there is OS and disk caching and other "magic" going on that will skew repeated reads of the same files... but here's a shot at some benchmarks. 当然，任何基准测试都是相对的，因为存在操作系统和磁盘缓存以及其他会导致重复读取相同文件的“魔法”......但这里有一些基准测试。 I loaded each method up and fired it off 5 times after compiling it fresh. 我把每个方法加载起来并在将它编译成新鲜后将其关闭5次。 The benchmark was taken from the last (5th) run as this would be the "hottest" run for that algorithm, and any "magic" (in my theory anyways). 基准测试取自最后一次（第5次），因为这将是该算法的“最热门”运行，以及任何“魔术”（在我的理论中无论如何）。

Here's the benchmarks so far: 

    Original Method - 14.987909 (s) 
    Latest Method - 11.236802 (s)

That is a 25.03% decrease in time taken to hash the same 2.92GB file. 散布相同的2.92GB文件所花费的时间25.03% decrease了25.03% decrease 。 Pretty good. 非常好。

Answer 1

3 suggestions: 3意见建议：

1) clear buffer after each read 1）每次读取后清除缓冲区

while (fc.read(bbf) != -1) {
    md.update(bbf.array(), 0, bytes);
    bbf.clear();
}

2) do not close both fc and fis, it's redundant, closing fis is enough. 2）不要关闭fc和fis，这是多余的，关闭fis就足够了。 FileInputStream.close API says: FileInputStream.close API说：

If this stream has an associated channel then the channel is closed as well.

3) if you want performance improvement with FileChannel use 3）如果您希望使用FileChannel提高性能

ByteBuffer.allocateDirect(1024);

Answer 2

Another possible improvement might come if the code only allocated the temp buffer once. 如果代码仅分配临时缓冲区一次，则可能会出现另一种可能的改进。

eg 例如

        int bufsize = 8192;
        ByteBuffer buffer = ByteBuffer.allocateDirect(bufsize); 
        byte[] temp = new byte[bufsize];
        int b = channel.read(buffer);

        while (b > 0) {
            buffer.flip();

            buffer.get(temp, 0, b);
            md.update(temp, 0, b);
            buffer.clear();

            b = channel.read(buffer);
        }

Addendum 附录

Note: There is a bug in the string building code. 注意：字符串构建代码中存在错误。 It prints zero as a single digit number. 它将零打印为单个数字。 This can easily be fixed. 这很容易修复。 eg 例如

hexString.append(mdbytes[i] == 0 ? "00" : Integer.toHexString((0xFF & mdbytes[i])));

Also, as an experiment, I rewrote the code to use mapped byte buffers. 另外，作为实验，我重写了代码以使用映射的字节缓冲区。 It runs about 30% faster (6-7 millis vs 9-11 millis FWIW). 它的运行速度提高了约30％（6-7毫秒对9-11毫瓦FWIW）。 I expect you could get more out of it if you wrote code hashing code that operated directly on the byte buffer. 如果您编写直接在字节缓冲区上运行的代码散列代码，我希望您能从中获得更多。

I attempted to account for JVM initialization and file system caching by hashing a different file with each algorithm before starting the timer. 我尝试通过在启动计时器之前使用每个算法散列不同的文件来考虑JVM初始化和文件系统缓存。 The first run through the code is about 25 times slower than a normal run. 第一次运行代码比正常运行慢约25倍。 This appears to be due to JVM initialization, because all runs in the timing loop are roughly the same length. 这似乎是由于JVM初始化，因为定时循环中的所有运行长度大致相同。 They do not appear to benefit from caching. 他们似乎没有从缓存中受益。 I tested with the MD5 algorithm. 我用MD5算法测试过。 Also, during the timing portion, only one algorithm is run for the duration of the test program. 此外，在定时部分期间，在测试程序的持续时间内仅运行一种算法。

The code in the loop is shorter, so potentially more understandable. 循环中的代码更短，因此可能更容易理解。 I'm not 100% certain what kind of pressure memory mapping many files under high volume would exert on the JVM, so that might be something you would need to research and consider if you wanted to consider this sort of solution if you wanted to run this under load. 我不是百分之百确定什么样的压力内存映射高容量下的许多文件会对JVM施加什么样的压力内存，所以如果你想要运行，如果你想要考虑这种解决方案，你可能需要研究和考虑这在负载下。

public static byte[] hash(File file, String hashAlgo) throws IOException {

    FileInputStream inputStream = null;

    try {
        MessageDigest md = MessageDigest.getInstance(hashAlgo);
        inputStream = new FileInputStream(file);
        FileChannel channel = inputStream.getChannel();

        long length = file.length();
        if(length > Integer.MAX_VALUE) {
            // you could make this work with some care,
            // but this code does not bother.
            throw new IOException("File "+file.getAbsolutePath()+" is too large.");
        }

        ByteBuffer buffer = channel.map(MapMode.READ_ONLY, 0, length);

        int bufsize = 1024 * 8;          
        byte[] temp = new byte[bufsize];
        int bytesRead = 0;

        while (bytesRead < length) {
            int numBytes = (int)length - bytesRead >= bufsize ? 
                                         bufsize : 
                                         (int)length - bytesRead;
            buffer.get(temp, 0, numBytes);
            md.update(temp, 0, numBytes);
            bytesRead += numBytes;
        }

        byte[] mdbytes = md.digest();
        return mdbytes;

    } catch (NoSuchAlgorithmException e) {
        throw new IllegalArgumentException("Unsupported Hash Algorithm.", e);
    }
    finally {
        if(inputStream != null) {
            inputStream.close();
        }
    }
}

Answer 3

Here are an example for File hashing using NIO 以下是使用NIO进行文件哈希的示例

Path 路径
FileChanngel FileChanngel
MappedByteBuffer MappedByteBuffer

And avoiding use of byte[]. 并避免使用byte []。 So this i think should an improved version of the above. 所以我认为这应该是上面的改进版本。 And an second nio example where the hashed value is stored in user attributes. 第二个nio示例，其中散列值存储在用户属性中。 That can be used for HTML etag generation an other samples there the file does not change. 这可以用于HTML etag生成，其他样本文件不会更改。

    public static final byte[] getFileHash(final File src, final String hashAlgo) throws IOException, NoSuchAlgorithmException {
    final int         BUFFER = 32 * 1024;
    final Path        file = src.toPath();
    try(final FileChannel fc   = FileChannel.open(file)) {
        final long        size = fc.size();
        final MessageDigest hash = MessageDigest.getInstance(hashAlgo);
        long position = 0;
        while(position < size) {
            final MappedByteBuffer data = fc.map(FileChannel.MapMode.READ_ONLY, 0, Math.min(size, BUFFER));
            if(!data.isLoaded()) data.load();
            System.out.println("POS:"+position);
            hash.update(data);
            position += data.limit();
            if(position >= size) break;
        }
        return hash.digest();
    }
}

public static final byte[] getCachedFileHash(final File src, final String hashAlgo) throws NoSuchAlgorithmException, FileNotFoundException, IOException{
    final Path path = src.toPath();
    if(!Files.isReadable(path)) return null;
    final UserDefinedFileAttributeView view = Files.getFileAttributeView(path, UserDefinedFileAttributeView.class);
    final String name = "user.hash."+hashAlgo;
    final ByteBuffer bb = ByteBuffer.allocate(64);
    try { view.read(name, bb); return ((ByteBuffer)bb.flip()).array();
    } catch(final NoSuchFileException t) { // Not yet calculated
    } catch(final Throwable t) { t.printStackTrace(); }
    System.out.println("Hash not found calculation");
    final byte[] hash = getFileHash(src, hashAlgo);
    view.write(name, ByteBuffer.wrap(hash));
    return hash;
}

FileChannel ByteBuffer和Hashing Files

问题描述

3 个解决方案

解决方案1
3 已采纳 2013-04-17 04:11:19

解决方案2
1 2014-04-01 20:15:55

解决方案3
-1 2013-10-06 21:40:35

FileChannel ByteBuffer和Hashing Files

问题描述

3 个解决方案

解决方案1 3 已采纳 2013-04-17 04:11:19

解决方案2 1 2014-04-01 20:15:55

解决方案3 -1 2013-10-06 21:40:35

解决方案1
3 已采纳 2013-04-17 04:11:19

解决方案2
1 2014-04-01 20:15:55

解决方案3
-1 2013-10-06 21:40:35