简体   繁体   English

比较大文件的内容

[英]compare the contents of large files

I need to compare the contents of very large files. 我需要比较非常大的文件的内容。 Speed of the program is important. 程序的速度很重要。 I need 100% match.I read a lot of information but did not find the optimal solution. 我需要100%匹配。我阅读了很多信息,但没有找到最佳解决方案。 I am haveconsidering two choices and both problems. 我考虑了两个选择和两个问题。

  1. Compare whole file byte by byte - not fast enough for large files. 逐字节比较整个文件 - 对于大文件来说不够快。
  2. File Comparison using Hashes - not 100% match the two files with the same hash. 使用哈希进行文件比较 - 不是100%匹配具有相同哈希的两个文件。

What would you suggest? 你会建议什么? Maybe I could make use of threads? 也许我可以利用线程? Could MemoryMappedFile be helpful? MemoryMappedFile可以提供帮助吗?

If you really need to to guarantee 100% that the files are 100% identical, then you need to do a byte-to-byte comparison. 如果您确实需要100%保证文件100%相同,那么您需要进行逐字节比较。 That's just entailed in the problem - the only hashing method with 0% risk of false matching is the identity function! 这只是问题所在 - 唯一具有0%错误匹配风险的散列方法是身份函数!

What we're left with is short-cuts that can quickly give us quick answers to let us skip the byte-for-byte comparison some of the time. 我们留下的是快捷方式,可以快速给我们快速答案,让我们在某些时候跳过逐字节比较。

As a rule, the only short-cut on proving equality is proving identity. 作为一项规则,证明平等的唯一捷径证明了身份。 In OO code that would be showing two objects where in fact the same object. 在OO代码中,它将显示两个实际上相同对象的对象。 The closest thing in files is if a binding or NTFS junction meant two paths were to the same file. 文件中最接近的是绑定或NTFS连接意味着两个路径是同一个文件。 This happens so rarely that unless the nature of the work made it more usual than normal, it's not going to be a net-gain to check on. 这种情况很少发生,除非工作的性质比平常更常见,否则检查不会是净增益。

So we're left short-cutting on finding mis-matches. 所以我们在寻找不匹配方面做得很短缺。 Does nothing to increase our passes, but makes our fails faster: 什么都没有增加我们的传球,但让我们失败更快:

  1. Different size, not byte-for-byte equal. 不同的大小,不是逐字节相等。 Simples! Simples!
  2. If you will examine the same file more than once, then hash it and record the hash. 如果您将多次检查同一个文件,则将其哈希并记录哈希值。 Different hash, guaranteed not equal. 不同的哈希,保证不相等。 The reduction in files that need a one-to-one comparison is massive. 需要一对一比较的文件减少量巨大。
  3. Many file formats are likely to have some areas in common. 许多文件格式可能有一些共同点。 Particularly the first bytes for many formats tend to be "magic numbers", headers etc. Either skip them, or skip then and then check last (if there is a chance of them being different but it's low). 特别是许多格式的第一个字节往往是“魔术数字”,标题等。要么跳过它们,要么跳过然后再检查最后(如果它们有可能不同但它很低)。

Then there's the matter of making the actual comparison as fast as possible. 然后就是尽可能快地进行实际比较。 Loading batches of 4 octets at a time into an integer and doing integer comparison will often be faster than octet-per-octet. 一次将4个八位字节的批量加载到整数中并进行整数比较通常比每个八位字节的八位字节更快。

Threading can help. 线程可以帮助。 One way is to split the actual comparison of the file into more than one operation, but if possible a bigger gain will be found by doing completely different comparisons in different threads. 一种方法是将文件的实际比较分成多个操作,但如果可能的话,通过在不同的线程中进行完全不同的比较,可以找到更大的增益。 I'd need to know a bit more about just what you are doing to advise much, but the main thing is to make sure the output of the tests is thread-safe. 我需要更多地了解你正在做的建议,但最重要的是要确保测试的输出是线程安全的。

If you do have more than one thread examining the same files, have them work far from each other. 如果您有多个线程检查相同的文件,请让它们相互远离。 Eg if you have four threads, you could split the file in four, or you could have one take byte 0, 4, 8 while another takes byte 1, 5, 9, etc. (or 4-octet group 0, 4, 8 etc). 例如,如果你有四个线程,你可以将文件分成四个,或者你可以有一个取字节0,4,8,而另一个取字节1,5,9等(或4个八位字节组0,4,8)等等)。 The latter is much more likely to have false sharing issues than the former, so don't do that. 后者比前者更容易出现错误的共享问题,所以不要这样做。

Edit: 编辑:

It also depends on just what you're doing with the files. 它还取决于你正在对文件做什么。 You say you need 100% certainty, so this bit doesn't apply to you, but it's worth adding for the more general problem that if the cost of a false-positive is a waste of resources, time or memory rather than an actual failure, then reducing it through a fuzzy short-cut could be a net-win and it can be worth profiling to see if this is the case. 你说你需要100%的确定性,所以这一点并不适用于你,但是如果假阳性的成本浪费资源,时间或内存而不是实际的失败,那么值得为更普遍的问题补充一点。 ,然后通过模糊的捷径减少它可能是一个净赢,并且值得分析,看看是否是这种情况。

If you are using a hash to speed things (it can at least find some definite mis-matches faster), then Bob Jenkins' Spooky Hash is a good choice; 如果你使用哈希来加快速度(它至少可以更快找到一些明确的错误匹配),那么Bob Jenkins的Spooky Hash是一个不错的选择; it's not cryptographically secure, but if that's not your purpose it creates as 128-bit hash very quickly (much faster than a cryptographic hash, or even than the approaches taken with many GetHashCode() implementations) that are extremely good at not having accidental collisions (the sort of deliberate collisions cryptographic hashes avoid is another matter). 它不是加密安全的,但如果这不是你的目的,它会非常快地创建128位散列(比加密散列快得多,甚至比许多GetHashCode()实现的方法快得多),非常擅长没有意外碰撞(那种故意碰撞加密哈希避免是另一回事)。 I implemented it for .Net and put it on nuget because nobody else had when I found myself wanting to use it. 我为.Net实现了它并把它放在nuget上,因为当我发现自己想要使用它时没有其他人。

Serial Compare 串行比较

Test File Size(s): 118 MB 测试文件大小:118 MB
Duration: 579 ms 持续时间: 579毫秒
Equal? 等于? true 真正

    static bool Compare(string filePath1, string filePath2)
    {
        using (FileStream file = File.OpenRead(filePath1))
        {
            using (FileStream file2 = File.OpenRead(filePath2))
            {
                if (file.Length != file2.Length)
                {
                    return false;
                }

                int count;
                const int size = 0x1000000;

                var buffer = new byte[size];
                var buffer2 = new byte[size];

                while ((count = file.Read(buffer, 0, buffer.Length)) > 0)
                {
                    file2.Read(buffer2, 0, buffer2.Length);

                    for (int i = 0; i < count; i++)
                    {
                        if (buffer[i] != buffer2[i])
                        {
                            return false;
                        }
                    }
                }
            }
        }

        return true;
    }


Parallel Compare 并行比较

Test File Size(s): 118 MB 测试文件大小:118 MB
Duration: 340 ms 持续时间: 340毫秒
Equal? 等于? true 真正

    static bool Compare2(string filePath1, string filePath2)
    {
        bool success = true;

        var info = new FileInfo(filePath1);
        var info2 = new FileInfo(filePath2);

        if (info.Length != info2.Length)
        {
            return false;
        }

        long fileLength = info.Length;
        const int size = 0x1000000;

        Parallel.For(0, fileLength / size, x =>
        {
            var start = (int)x * size;

            if (start >= fileLength)
            {
                return;
            }

            using (FileStream file = File.OpenRead(filePath1))
            {
                using (FileStream file2 = File.OpenRead(filePath2))
                {
                    var buffer = new byte[size];
                    var buffer2 = new byte[size];

                    file.Position = start;
                    file2.Position = start;

                    int count = file.Read(buffer, 0, size);
                    file2.Read(buffer2, 0, size);

                    for (int i = 0; i < count; i++)
                    {
                        if (buffer[i] != buffer2[i])
                        {
                            success = false;
                            return;
                        }
                    }
                }
            }
        });

        return success;
    }


MD5 Compare MD5比较

Test File Size(s): 118 MB 测试文件大小:118 MB
Duration: 702 ms 持续时间: 702毫秒
Equal? 等于? true 真正

    static bool Compare3(string filePath1, string filePath2)
    {
        byte[] hash1 = GenerateHash(filePath1);
        byte[] hash2 = GenerateHash(filePath2);

        if (hash1.Length != hash2.Length)
        {
            return false;
        }

        for (int i = 0; i < hash1.Length; i++)
        {
            if (hash1[i] != hash2[i])
            {
                return false;
            }
        }

        return true;
    }

    static byte[] GenerateHash(string filePath)
    {
        MD5 crypto = MD5.Create();

        using (FileStream stream = File.OpenRead(filePath))
        {
            return crypto.ComputeHash(stream);
        }
    }

tl;dr Compare byte segments in parallel to determine if two files are equal. tl; dr并行比较字节段以确定两个文件是否相等。

Why not both? 为什么不兼得?

Compare with hashes for the first pass, then return to conflicts and perform the byte-by-byte comparison. 与第一遍的哈希比较,然后返回冲突并执行逐字节比较。 This allows maximal speed with guaranteed 100% match confidence. 这允许最大速度,保证100%匹配置信度。

There's no avoiding doing byte-for-byte comparisons if you want perfect comparisons (The file still has to be read byte-for-byte to do any hashing), so the issue is how you're reading and comparing the data. 如果你想要完美的比较(文件仍然必须逐字节地读取以进行任何散列),就没有避免进行逐字节比较,所以问题在于你如何阅读和比较数据。

So a there are two things you'll want to address: 所以有两件事你想要解决:

  • Concurrency - Make sure you're reading data at the same time you're checking it. 并发 - 确保在检查数据的同时读取数据。
  • Buffer Size - Reading the file 1 byte at a time is going to be slow, make sure you're reading it into a decent size buffer (about 8MB should do nicely on very large files) 缓冲区大小 - 一次读取1个字节的文件会很慢,请确保将其读入一个合适的大小缓冲区(大约8MB应该可以很好地处理非常大的文件)

The objective is to make sure you can do your comparison as fast as the hard disk can read the data, and that you're always reading data with no delays. 目标是确保您可以像硬盘读取数据一样快地进行比较,并且您始终可以无延迟地读取数据。 If you're doing everything as fast as the data can be read from the drive then that's as fast as it is possible to do it since the hard disk read speed becomes the bottleneck. 如果您正在尽可能快地从驱动器读取数据,那么由于硬盘读取速度成为瓶颈,因此可以尽可能快地执行此操作。

Ultimately a hash is going to read the file byte by byte anyway ... so if you are looking for an accurate comparison then you might as well do the comparison. 最终,哈希将逐字节地读取文件...因此,如果您正在寻找准确的比较,那么您也可以进行比较。 Can you give some more background on what you are trying to accomplish? 你能否提供一些关于你想要完成的事情的更多背景知识? How big are the 'big' files? “大”文件有多大? How often do you have to compare them? 你经常需要比较它们吗?

If you have a large set of files and you are trying to identify duplicates, I would try to break down the work by order of expense. 如果您有大量文件并且您正在尝试识别重复项,我会尝试按费用顺序细分工作。 I might try something like the following: 我可能会尝试以下内容:

1) group files by size. 1)按大小分组文件。 Files with different sizes clearly can't be identical. 不同大小的文件显然不能相同。 This information is very inexpensive to retrieve. 检索此信息非常便宜。 If each group only contains 1 file, you are done, no dupes, otherwise proceed to step 2. 如果每个组只包含1个文件,则表示您已完成,没有欺骗,否则请继续执行步骤2。

2) Within each size group generate a hash of the first n bytes of the file. 2)在每个大小组内生成文件的前n个字节的散列。 Identify a reasonable n that will likely detect differences. 确定可能检测到差异的合理n。 Many files have identical headers, so you wan't to make sure n is greater that that header length. 许多文件具有相同的标题,因此您不必确保n大于标题长度。 Group by the hashes, if each group contains 1 file, you are done (no dupes within this group), otherwise proceed to step 3. 按哈希分组,如果每个组包含1个文件,则表示您已完成(此组中没有欺骗),否则请继续执行步骤3。

3) At this point you are likely going to have to do more expensive work like generate a hash of the whole file, or do a byte by byte comparison. 3)此时你可能不得不做更多昂贵的工作,比如生成整个文件的哈希值,或者进行逐字节比较。 Depending on the number of files, and the nature of the file contents, you might try different approaches. 根据文件的数量和文件内容的性质,您可以尝试不同的方法。 Hopefully, the previous groupings will have narrowed down likely duplicates so that the number of files that you actually have to fully scan will be very small. 希望以前的分组能够缩小可能的重复数,以便您实际必须完全扫描的文件数量非常少。

To calculate a hash, the entire file needs to be read. 要计算哈希值,需要读取整个文件。

How about opening both files together, and comparing them chunk by chunk? 如何将两个文件一起打开,并将它们按块进行比较?

Pseudo code: 伪代码:

open file A
open file B
while file A has more data
{
    if next chunk of A != next chunk of B return false
}
return true

This way you are not loading too much together, and not reading in the entire file if you find a mismatch earlier. 这样你就不会加载太多,如果你之前发现不匹配,就不会读取整个文件。 You should set up a test that varies the chunk size to determine the right size for optimal performance. 您应该设置一个可以改变块大小的测试,以确定最佳性能的正确大小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM