简体   繁体   English

有什么方法可以加快C#中15,000个小文件的打开和哈希处理?

[英]Is there any way I can speed up the opening and hashing of 15,000 small files in C#?

I'm working on SHA1 checksum hashing 15,000 images (40KB - 1.0MB each, approximately 1.8GB total). 我正在处理SHA1校验和,对15,000张图像进行哈希处理(每张图像40KB-1.0MB,总计约1.8GB)。 I'd like to speed this up as it is going to be a key operation in my program and right now it is taking between 500-600 seconds. 我想加快速度,因为这将是我程序中的关键操作,现在需要500-600秒。

I've tried the following which took 500 seconds: 我尝试了以下耗时500秒的操作:

 public string GetChecksum(string filePath)
        {
            FileStream fs = new FileStream(filePath, FileMode.Open);
            using (SHA1Managed sha1 = new SHA1Managed())
            {
                return BitConverter.ToString(sha1.ComputeHash(fs));
            }

        }

Then I thought maybe the chunks SHA1Managed() was reading in were too small so I used a BufferedReader and increased the buffer size to greater than the size of any of the files I'm reading in. 然后我认为也许正在读取的SHA1Managed()块太小,所以我使用了BufferedReader并将缓冲区大小增加到大于我正在读取的任何文件的大小。

 public string GetChecksum(string filePath)
        {
            using (var bs = new BufferedStream(File.OpenRead(filePath), 1200000))
            {
                using (SHA1Managed sha1 = new SHA1Managed())
                {
                    return BitConverter.ToString(sha1.ComputeHash(bs));
                }
            }
        }

This actually took 600 seconds. 这实际上花费了600秒。

Is there anything I can do to speed up these IO operations, or am I stuck with what I got? 什么我可以做的以加快这些IO操作的速度吗,还是我束手无策?


As per x0n's suggestion I tried just reading in each file into a byte array and discarding the result. 根据x0n的建议,我尝试仅将每个文件读入字节数组并丢弃结果。 It appears I'm IO bound as this took ~480 seconds in itself. 似乎我受了IO的束缚,这本身花费了大约480秒。

You are creating and destroying the SHA1Managed class for EVERY file; 您正在为每个文件创建和销毁SHA1Managed类。 this is horrifically inefficient. 这简直是​​无效的。 Create it once, and call ComputeHash 15,000 times instead and you'll get a huge performance increase (IMO.) 创建一次,然后调用ComputeHash 15,000次,您将获得巨大的性能提升(IMO。)

public Dictionary<string,string> GetChecksums(string[] filePaths)
{ 
    var checksums = new Dictionary<string,string>(filePaths.length);

    using (SHA1Managed sha1 = new SHA1Managed()) 
    { 
         foreach (string filePath in filePaths) {
              using (var fs = File.OpenRead(filePath)) {
                  checksums.Add(filePath, BitConverter.ToString(sha1.ComputeHash(fs)));
              }
         }         
    }
    return checksums;
}

The SHA1Managed class is particularly slow to create/destroy because it calls out to p/invoke native win32 classes. SHA1Managed类的创建/销毁速度特别慢,因为它会调用p /调用本机win32类。

-Oisin -Oisin

Profile it first. 首先对其进行分析。

Try dotTrace: http://www.jetbrains.com/profiler/ 尝试dotTrace: http//www.jetbrains.com/profiler/

You didn't say whether your operation is CPU bound, or IO bound. 您没有说操作是受CPU限制还是受IO限制。

With a hash, I would suspect it is CPU bound. 使用散列,我会怀疑它受CPU限制。 If it is CPU bound, you will see the CPU saturated (100% utilized) during the computation of the SHA hashes. 如果受CPU限制,则在计算SHA哈希值时将看到CPU饱和(已使用100%)。 If it is IO bound, the CPU will not be saturated. 如果绑定了IO,则CPU将不会饱和。

If it is CPU bound, and you have a multi-CPU or multi-core machine (true for most laptops built in the last 2 years, and almost all servers built since 2002), then you can get an instant increase by using multiple threads, and multiple Sha1Managed() instances, and computing the SHA's in parallel. 如果它是受CPU约束的,并且您具有多CPU或多核计算机(过去两年中构建的大多数笔记本电脑都适用,并且自2002年以来构建的几乎所有服务器都适用),则可以通过使用多线程立即增加性能,以及多个Sha1Managed()实例,并并行计算SHA。 If it's a dual-core machine - 2x. 如果是双核计算机-2倍。 If it's a dual-core 2-cpu machine (typical server) you'll get 4x throughput. 如果是双核2-cpu机器(典型服务器),则吞吐量将提高4倍。

By the way, when a single-threaded program like yours "saturates" the CPU on a dual-core machine, will show up as 50% utilization in Windows Task Manager. 顺便说一句,当像您这样的单线程程序“饱和”双核计算机上的CPU时,Windows Task Manager中的利用率将显示为50%。

You need to manage the workflow through the threads, to keep track of which thread is working on which file. 您需要通过线程管理工作流,以跟踪哪个线程在哪个文件上工作。 But this isn't hard to do. 但这并不难。

使用“ ramdisk”-在内存中构建文件系统。

Have you tried using the SHA1CryptoServiceProvider class instead of SHA1Managed? 您是否尝试过使用SHA1CryptoServiceProvider类而不是SHA1Managed? SHA1CryptoServiceProvider is implemented in native code, not managed code, and was much quicker in my experience. SHA1CryptoServiceProvider是用本机代码而不是托管代码实现的,而根据我的经验,它要快得多。 For example: 例如:

public static byte[] CreateSHA1Hash(string filePath)
{
    byte[] hash = null;



    using (SHA1CryptoServiceProvider sha1 = new SHA1CryptoServiceProvider())
    {
        using(FileStream fs = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 131072))
        {
            hash = sha1.ComputeHash(fs);
        }

        //hash = sha1.ComputeHash(File.OpenRead(filePath));
    }

    return hash;
}

Also, with 15000 files I would use a file enumerator approach (ie WinAPI: FindFirstFile(), FindNextFile()) rather than the standard .NET Directory.GetFiles(). 另外,对于15000个文件,我将使用文件枚举器方法(即WinAPI:FindFirstFile(),FindNextFile()),而不是标准的.NET Directory.GetFiles()。

Directory.GetFiles loads all file paths into memory in one go. Directory.GetFiles一次性将所有文件路径加载到内存中。 This is often much slower than enumerating files directory by directory using the WinAPI functions. 这通常比使用WinAPI函数逐目录枚举文件要慢得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM