查找字节数组数组是否包含另一个字节数组的最快方法是什么？

Question

I have some code that is really slow. 我有一些非常慢的代码。 I knew it would be and now it is. 我知道它会是，现在是。 Basically, I am reading files from a bunch of directories. 基本上，我正在从一堆目录中读取文件。 The file names change but the data does not. 文件名会更改，但数据不会更改。 To determine if I have read the file, I am hashing it's bytes and comparing that to a list of hashes of already processed files. 为了确定我是否已经读取了该文件，我正在对其字节进行哈希并将其与已处理文件的哈希列表进行比较。 There are about 1000 files in each directory, and figuring out what's new in each directory takes a good minute or so (and then the processing starts). 每个目录中大约有1000个文件，并且确定每个目录中的新内容需要大约一分钟左右（然后处理开始）。 Here's the basic code: 这是基本代码：

public static class ProgramExtensions
{
    public static byte[] ToSHA256Hash(this FileInfo file)
    {
        using (FileStream fs = new FileStream(file.FullName, FileMode.Open))
        {
            using (SHA256 hasher = new SHA256Managed())
            {
                return hasher.ComputeHash(fs);
            }
        }
    }
    public static string ToHexString(this byte[] p)
    {

        char[] c = new char[p.Length * 2 + 2];

        byte b;

        c[0] = '0'; c[1] = 'x';

        for (int y = 0, x = 2; y < p.Length; ++y, ++x)
        {
            b = ((byte)(p[y] >> 4));

            c[x] = (char)(b > 9 ? b + 0x37 : b + 0x30);

            b = ((byte)(p[y] & 0xF));

            c[++x] = (char)(b > 9 ? b + 0x37 : b + 0x30);
        }

        return new string(c);

    }
}

class Program
{
    static void Main(string[] args)
    {
        var allFiles = new DirectoryInfo("c:\\temp").GetFiles("*.*");

        List<string> readFileHashes = GetReadFileHashes();

        List<FileInfo> filesToRead = new List<FileInfo>();

        foreach (var file in allFiles)
        {
            if (readFileHashes.Contains(file.ToSHA256Hash().ToHexString()))
                filesToRead.Add(file);
        }

        //read new files
    }
}

Is there anyway I can speed this up? 无论如何我可以加快速度吗？

Answer 1

I believe you can archive the most significant performance improvement by simply first checking the filesize, if the filesize does not match, you can skip the entire file and don't even open it. 我相信你可以通过简单地首先检查文件大小来存档最重要的性能改进，如果filesize不匹配，你可以跳过整个文件，甚至不打开它。

Instead of just saving a list of known hashes, you would also keep a list of known filesizes and only do a content comparison when filesizes match. 您还可以保留已知文件大小的列表，并在文件大小匹配时仅进行内容比较，而不仅仅是保存已知哈希列表。 When filesize doesn't match, you can save yourself from even looking at the file content. 当filesize不匹配时，您甚至可以避免查看文件内容。

Depending on the general size your files generally have, a further improvement can be worthwhile: 根据文件的一般大小，进一步的改进是值得的：

Either doing a binary compare with early abort when the first byte is different (saves reading the entire file which can be a very significant improvement if your files generally are large, any hash algorithm would read the entire file. Detecting that the first byte is different saves you from reading the rest of the file). 当第一个字节不同时，要么与早期中止进行二进制比较（保存读取整个文件，如果文件通常很大，这可能是一个非常显着的改进，任何哈希算法都会读取整个文件。检测第一个字节是不同的使您免于阅读文件的其余部分）。 If your lookup file list likely contains many files of the same size so you'd likely have to do a binary comparison against several files instead consider: 如果您的查找文件列表可能包含许多相同大小的文件，那么您可能需要对多个文件进行二进制比较，而是考虑：
hashing in blocks of say 1MB each. 以每个1MB的块为单位进行散列。 First check the first block only against the precalculated 1st block hash in your lookup. 首先仅针对查找中预先计算的第一个块哈希检查第一个块。 Only compare 2nd block if 1st block is the same, saves reading beyond 1st block in most cases for different files. 如果第一个块相同，则仅比较第二个块，在大多数情况下，对于不同的文件，将读数保存在第一个块之外 Both those options are only really worth the effort when your files are large. 当文件很大时，这两个选项都非常值得。

I doubt that changing the hashing algorithm itself (eg first check doing a CRC as suggested) would make any significant difference. 我怀疑更改散列算法本身（例如，首先检查按建议执行CRC）会产生任何显着差异。 Your bottleneck is likely disk IO, not CPU so avoiding disk IO is what will give you the most improvement. 您的瓶颈可能是磁盘IO，而不是CPU，因此避免磁盘IO会给您带来最大的改进。 But as always in performance, do measure. 但是，与性能一样， 请进行衡量。

Then, if this is still not enough (and only then), experiment with asynchronous IO (remember though that sequential reads are generally faster than random access, so too much random asynchronous reading can hurt your performance) 然后，如果这仍然不够（并且只有那时），请尝试使用异步IO（请记住，顺序读取通常比随机访问更快，因此过多的随机异步读取会损害您的性能）

Answer 2

Create a file list 创建文件列表
Sort the list by filesize 按文件大小排序列表
Eliminate files with unique sizes from the list 从列表中删除具有唯一大小的文件
Now do hashing (a fast hash first might improve performance as well) 现在做散列（快速散列首先可能也会提高性能）

Answer 3

Use an data structure for your readFileHashes store that has an efficient search capability (hashing or binary search). 使用具有高效搜索功能（散列或二进制搜索）的readFileHashes存储的数据结构。 I think HashSet or TreeSet would serve you better here. 我认为HashSet或TreeSet会在这里为您提供更好的服务。
Use an appropriate checksum (hash sum) function. 使用适当的校验和（哈希和）函数。 SHA256 is a cryptographic hash that is probably overkill. SHA256是一个加密哈希，可能有点过分。 CRC is less computationally expensive, originally intended for catching unintentional/random changes (tranmission errors), but is susceptable to changes to are designed/intended to be hidden. CRC的计算成本较低，最初用于捕获无意/随机的变化（传输错误），但是对于被设计/意图隐藏的变化是可接受的。 What fits the differences between the files you are scanning? 什么适合您正在扫描的文件之间的差异？
See http://en.wikipedia.org/wiki/List_of_checksum_algorithms#Computational_costs_of_CRCs_vs_Hashes 见http://en.wikipedia.org/wiki/List_of_checksum_algorithms#Computational_costs_of_CRCs_vs_Hashes
Would a really simple checksum via sampling (eg checksum = (first 10 bytes and last 10 bytes)) work? 通过采样（例如校验和=（前10个字节和后10个字节））真正简单的校验和是否有效？

Answer 4

I'd do a quick CRC hash check first, as it is less expensive. 我先做一个快速CRC哈希检查，因为它更便宜。 if the CRC does not match, continue on with a more "reliable" hash test such as SHA 如果CRC不匹配，继续进行更“可靠”的哈希测试，例如SHA

Answer 5

Your description of the problem still isn't clear enough. 您对问题的描述仍然不够明确。

The biggest problem is that you are doing a bunch of hashing. 最大的问题是你正在做一堆哈希。 This is guaranteed to be slow. 这保证很慢。

You might want to try searching for the modification time, which does not change if a filename is changed: 您可能想尝试搜索修改时间，如果文件名已更改，则修改时间不会更改：

http://msdn.microsoft.com/en-us/library/ms724320(VS.85,loband).aspx http://msdn.microsoft.com/en-us/library/ms724320(VS.85,loband).aspx

Or you might want to monitor the folder for any new file changes: 或者，您可能希望监视文件夹以查找任何新文件：

http://www.codeguru.com/forum/showthread.php?t=436716 http://www.codeguru.com/forum/showthread.php?t=436716

Answer 6

First group the files by file sizes - this will leave you with smaller groups of files. 首先按文件大小对文件进行分组 - 这将为您留下较小的文件组。 Now it depends on the group size and file sizes. 现在它取决于组大小和文件大小。 You could just start reading all files in parallel until you find a difference. 您可以开始并行读取所有文件，直到找到差异为止。 If there is a difference, split the group into smaller groups having the same value at the current position. 如果存在差异，请将组拆分为在当前位置具有相同值的较小组。 If you have information how the files differ, you can use this information - start reading at the end, don't read and compare byte by byte if larger cluster change, or what ever you know about the files. 如果您有关于文件如何不同的信息，您可以使用此信息 - 最后开始阅读，如果更大的群集更改，或者您对文件的了解，请不要逐字节读取和比较。 This solution might introduce I/O performance problems if you have to read many files in parallel causing random disc access. 如果您必须并行读取许多文件导致随机光盘访问，此解决方案可能会引入I / O性能问题。

You could also calculate hash values for all files in each group and compare them. 您还可以计算每个组中所有文件的哈希值并进行比较。 You must not neccessarily process the whole files at once - just calculate the hash of a few (maybe a 4kiB cluster or whatever fits your file sizes) bytes and check if there are allready differences. 您不能一次处理整个文件 - 只需计算一些（可能是4kiB集群或任何适合您的文件大小）字节的散列，并检查是否存在所有已有的差异。 If not, calculate the hashes of the next few bytes. 如果不是，则计算接下来几个字节的哈希值。 This will give you the possibility to process larger blocks of each file without requiring to keep one such large block for each file in a group in the memory. 这将使您可以处理每个文件的较大块，而无需为内存中的组中的每个文件保留一个这样的大块。

So its all about a time-memory (disc I/O-memory) trade-off. 所以关于时间内存（光盘I / O内存）的权衡。 You have to find your way between reading all files in a group into memory and comparing them byte by byte (high memory requirement, fast sequential access, but may read to much data) and reading the files byte by byte and comparing only the last byte read (low memory requirement, slow random access, reads only required data). 你必须在将一组中的所有文件读入内存并逐字节地比较它们之间找到方法（高内存要求，快速顺序访问，但可以读取大量数据）并逐字节读取文件并仅比较最后一个字节读取（低内存要求，慢速随机访问，只读取所需数据）。 Further, if the groups are very large, comparing the files byte by byte will become slower - comparing one byte from n files is a O(n) operation - and it might become more efficient to calculate hash values first and then compare only the hash values. 此外，如果组非常大，逐字节比较文件将变得更慢 - 比较n个文件中的一个字节是O（n）操作 - 并且首先计算哈希值然后仅比较哈希值可能更有效值。

Answer 7

updated: Definitely DO NOT make your only check for file size. 更新：绝对不要只检查文件大小。 If your os version allows use FileInfo.LastWriteTime 如果你的os版本允许使用FileInfo.LastWriteTime

I've implemented something similar for an in-house project compiler/packager. 我已经为内部项目编译器/打包器实现了类似的功能。 We have over 8k files so we store the last modified dates and hash data into a sql database. 我们有超过8k的文件，因此我们将最后修改的日期和哈希数据存储到sql数据库中。 then on subsequent runs we query first against the modified date on any specific file, and only then on the hash data... that way we only calculate new hash data for those files that appear to be modified... 然后在后续运行中，我们首先查询任何特定文件上的修改日期，然后才查询哈希数据...这样我们只计算那些看似被修改的文件的新哈希数据...

.net has a way to check for last modified date, in the FileInfo class.. I suggest you check it out. .net有一种方法可以在FileInfo类中检查上次修改日期..我建议你查看一下。 EDIT: here is the link LastWriteTime 编辑：这是链接LastWriteTime

Our packager takes about 20 secs to find out what files have been modified. 我们的打包器大约需要20秒才能找出修改过的文件。

查找字节数组数组是否包含另一个字节数组的最快方法是什么？

问题描述

7 个解决方案

解决方案1
8 已采纳 2009-06-09 21:52:24

解决方案2
1 2009-06-09 21:53:35

解决方案3
1 2009-06-09 22:15:15

解决方案4
0 2009-06-09 21:47:19

解决方案5
0 2009-06-09 22:13:32

解决方案6
0 2009-06-09 22:33:28

解决方案7
0 2009-06-09 23:22:59

查找字节数组数组是否包含另一个字节数组的最快方法是什么？

问题描述

7 个解决方案

解决方案1 8 已采纳 2009-06-09 21:52:24

解决方案2 1 2009-06-09 21:53:35

解决方案3 1 2009-06-09 22:15:15

解决方案4 0 2009-06-09 21:47:19

解决方案5 0 2009-06-09 22:13:32

解决方案6 0 2009-06-09 22:33:28

解决方案7 0 2009-06-09 23:22:59

解决方案1
8 已采纳 2009-06-09 21:52:24

解决方案2
1 2009-06-09 21:53:35

解决方案3
1 2009-06-09 22:15:15

解决方案4
0 2009-06-09 21:47:19

解决方案5
0 2009-06-09 22:13:32

解决方案6
0 2009-06-09 22:33:28

解决方案7
0 2009-06-09 23:22:59