有什么方法可以加快C＃中15,000个小文件的打开和哈希处理？

Question

I'm working on SHA1 checksum hashing 15,000 images (40KB - 1.0MB each, approximately 1.8GB total). 我正在处理SHA1校验和，对15,000张图像进行哈希处理（每张图像40KB-1.0MB，总计约1.8GB）。 I'd like to speed this up as it is going to be a key operation in my program and right now it is taking between 500-600 seconds. 我想加快速度，因为这将是我程序中的关键操作，现在需要500-600秒。

I've tried the following which took 500 seconds: 我尝试了以下耗时500秒的操作：

 public string GetChecksum(string filePath)
        {
            FileStream fs = new FileStream(filePath, FileMode.Open);
            using (SHA1Managed sha1 = new SHA1Managed())
            {
                return BitConverter.ToString(sha1.ComputeHash(fs));
            }

        }

Then I thought maybe the chunks SHA1Managed() was reading in were too small so I used a BufferedReader and increased the buffer size to greater than the size of any of the files I'm reading in. 然后我认为也许正在读取的SHA1Managed（）块太小，所以我使用了BufferedReader并将缓冲区大小增加到大于我正在读取的任何文件的大小。

 public string GetChecksum(string filePath)
        {
            using (var bs = new BufferedStream(File.OpenRead(filePath), 1200000))
            {
                using (SHA1Managed sha1 = new SHA1Managed())
                {
                    return BitConverter.ToString(sha1.ComputeHash(bs));
                }
            }
        }

This actually took 600 seconds. 这实际上花费了600秒。

Is there anything I can do to speed up these IO operations, or am I stuck with what I got? 有什么我可以做的以加快这些IO操作的速度吗，还是我束手无策？

As per x0n's suggestion I tried just reading in each file into a byte array and discarding the result. 根据x0n的建议，我尝试仅将每个文件读入字节数组并丢弃结果。 It appears I'm IO bound as this took ~480 seconds in itself. 似乎我受了IO的束缚，这本身花费了大约480秒。

Answer 1

You are creating and destroying the SHA1Managed class for EVERY file; 您正在为每个文件创建和销毁SHA1Managed类。 this is horrifically inefficient. 这简直是无效的。 Create it once, and call ComputeHash 15,000 times instead and you'll get a huge performance increase (IMO.) 创建一次，然后调用ComputeHash 15,000次，您将获得巨大的性能提升（IMO。）

public Dictionary<string,string> GetChecksums(string[] filePaths)
{ 
    var checksums = new Dictionary<string,string>(filePaths.length);

    using (SHA1Managed sha1 = new SHA1Managed()) 
    { 
         foreach (string filePath in filePaths) {
              using (var fs = File.OpenRead(filePath)) {
                  checksums.Add(filePath, BitConverter.ToString(sha1.ComputeHash(fs)));
              }
         }         
    }
    return checksums;
}

The SHA1Managed class is particularly slow to create/destroy because it calls out to p/invoke native win32 classes. SHA1Managed类的创建/销毁速度特别慢，因为它会调用p /调用本机win32类。

-Oisin -Oisin

Answer 2

Profile it first. 首先对其进行分析。

Try dotTrace: http://www.jetbrains.com/profiler/ 尝试dotTrace： http ： //www.jetbrains.com/profiler/

Answer 3

You didn't say whether your operation is CPU bound, or IO bound. 您没有说操作是受CPU限制还是受IO限制。

With a hash, I would suspect it is CPU bound. 使用散列，我会怀疑它受CPU限制。 If it is CPU bound, you will see the CPU saturated (100% utilized) during the computation of the SHA hashes. 如果受CPU限制，则在计算SHA哈希值时将看到CPU饱和（已使用100％）。 If it is IO bound, the CPU will not be saturated. 如果绑定了IO，则CPU将不会饱和。

If it is CPU bound, and you have a multi-CPU or multi-core machine (true for most laptops built in the last 2 years, and almost all servers built since 2002), then you can get an instant increase by using multiple threads, and multiple Sha1Managed() instances, and computing the SHA's in parallel. 如果它是受CPU约束的，并且您具有多CPU或多核计算机（过去两年中构建的大多数笔记本电脑都适用，并且自2002年以来构建的几乎所有服务器都适用），则可以通过使用多线程立即增加性能，以及多个Sha1Managed（）实例，并并行计算SHA。 If it's a dual-core machine - 2x. 如果是双核计算机-2倍。 If it's a dual-core 2-cpu machine (typical server) you'll get 4x throughput. 如果是双核2-cpu机器（典型服务器），则吞吐量将提高4倍。

By the way, when a single-threaded program like yours "saturates" the CPU on a dual-core machine, will show up as 50% utilization in Windows Task Manager. 顺便说一句，当像您这样的单线程程序“饱和”双核计算机上的CPU时，Windows Task Manager中的利用率将显示为50％。

You need to manage the workflow through the threads, to keep track of which thread is working on which file. 您需要通过线程管理工作流，以跟踪哪个线程在哪个文件上工作。 But this isn't hard to do. 但这并不难。

Answer 4

使用“ ramdisk”-在内存中构建文件系统。

Answer 5

Have you tried using the SHA1CryptoServiceProvider class instead of SHA1Managed? 您是否尝试过使用SHA1CryptoServiceProvider类而不是SHA1Managed？ SHA1CryptoServiceProvider is implemented in native code, not managed code, and was much quicker in my experience. SHA1CryptoServiceProvider是用本机代码而不是托管代码实现的，而根据我的经验，它要快得多。 For example: 例如：

public static byte[] CreateSHA1Hash(string filePath)
{
    byte[] hash = null;



    using (SHA1CryptoServiceProvider sha1 = new SHA1CryptoServiceProvider())
    {
        using(FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 131072))
        {
            hash = sha1.ComputeHash(fs);
        }

        //hash = sha1.ComputeHash(File.OpenRead(filePath));
    }

    return hash;
}

Also, with 15000 files I would use a file enumerator approach (ie WinAPI: FindFirstFile(), FindNextFile()) rather than the standard .NET Directory.GetFiles(). 另外，对于15000个文件，我将使用文件枚举器方法（即WinAPI：FindFirstFile（），FindNextFile（）），而不是标准的.NET Directory.GetFiles（）。

Directory.GetFiles loads all file paths into memory in one go. Directory.GetFiles一次性将所有文件路径加载到内存中。 This is often much slower than enumerating files directory by directory using the WinAPI functions. 这通常比使用WinAPI函数逐目录枚举文件要慢得多。

有什么方法可以加快C＃中15,000个小文件的打开和哈希处理？

问题描述

5 个解决方案

解决方案1
5 2010-01-04 02:43:26

解决方案2
2 2010-01-04 03:11:38

解决方案3
1 2010-01-04 02:59:57

解决方案4
0 2010-01-04 02:43:34

解决方案5
0 2010-01-04 03:14:19

有什么方法可以加快C＃中15,000个小文件的打开和哈希处理？

问题描述

5 个解决方案

解决方案1 5 2010-01-04 02:43:26

解决方案2 2 2010-01-04 03:11:38

解决方案3 1 2010-01-04 02:59:57

解决方案4 0 2010-01-04 02:43:34

解决方案5 0 2010-01-04 03:14:19

解决方案1
5 2010-01-04 02:43:26

解决方案2
2 2010-01-04 03:11:38

解决方案3
1 2010-01-04 02:59:57

解决方案4
0 2010-01-04 02:43:34

解决方案5
0 2010-01-04 03:14:19