简体   繁体   English

比较两个文件哪个更好?

[英]Which is a better way to compare 2 files?

I have the following situation in C#: 我在C#中有以下情况:

ZipFile z1 = ZipFile.Read("f1.zip");
ZipFile z2 = ZipFile.Read("f2.zip");


MemoryStream ms1 = new MemoryStream();
MemoryStream ms2 = new MemoryStream()


ZipEntry zipentry1 = zip1["f1.dll"];
ZipEntry zipentry1 = zip2["f2.dll"];


zipentry1.Extract(ms1);
zipentry2.Extract(ms2);


byte[] b1 = new byte[ms1.Length];
byte[] b2 = new byte[ms2.Length];


ms1.Seek(0, SeekOrigin.Begin);
ms2.Seek(0, SeekOrigin.Begin);

what I have done here is opened 2 zip files f1.zip and f2.zip. 我在这里所做的是打开了2个zip文件f1.zip和f2.zip。 Then I extract 2 files inside them (f1.txt and f2.txt inside f1.zip and f2.zip respectively) onto the MemoryStream objects. 然后,我将其中的2个文件(分别在f1.zip和f2.zip中的f1.txt和f2.txt)提取到MemoryStream对象上。 I now want to compare the files and find out if they are the same or not. 现在,我想比较文件并找出它们是否相同。 I had 2 ways in mind: 我想到了两种方法:

1) Read the memory streams byte by byte and compare them. 1)逐字节读取内存流并进行比较。 For this I would use 为此,我会用

ms1.BeginRead(b1, 0, (int) ms1.Length, null, null);
ms2.BeginRead(b2, 0, (int) ms2.Length, null, null);

and then run a for loop and compare each byte in b1 and b2. 然后运行for循环并比较b1和b2中的每个字节。

2) Get the string values for both the memory streams and then do a string compare. 2)获取两个内存流的字符串值,然后进行字符串比较。 For this I would use 为此,我会用

string str1 = Encoding.UTF8.GetString(ms1.GetBuffer(), 0, (int)ms1.Length);
string str2 = Encoding.UTF8.GetString(ms2.GetBuffer(), 0, (int)ms2.Length);

and then do a simple string compare. 然后做一个简单的字符串比较。

Now, I know comparing byte by byte will always give me a correct result. 现在,我知道逐字节比较总是可以得到正确的结果。 But the problem with it is, it will take a lot time as I have to do this for thousands of files. 但是问题是,这将花费很多时间,因为我必须对数千个文件执行此操作。 That is why I am thinking about the string compare method which looks to find out if the files are equal or not very quickly. 这就是为什么我在考虑字符串比较方法的原因,该方法看起来可以很快找出文件是否相等。 But I am not sure if string compare will give me the correct result as the files are either dlls or media files etc and will contain special characters for sure. 但是我不确定字符串比较是否会给我正确的结果,因为文件是dll或媒体文件等,并且肯定包含特殊字符。

Can anyone tell me if the string compare method will work correctly or not ? 谁能告诉我字符串比较方法是否可以正常工作?

Thanks in advance. 提前致谢。

PS : I am using DotNetLibrary. PS:我正在使用DotNetLibrary。

The baseline for this question is the native way to compare arrays: Enumerable.SequenceEqual . 此问题的基线是比较数组的本机方法: Enumerable.SequenceEqual You should use that unless you have good reason to do otherwise. 除非有充分的理由,否则应该使用它。

If you care about speed, you could attempt to p/invoke to memcmp in msvcrt.dll and compare the byte arrays that way. 如果您关心速度,则可以尝试在msvcrt.dll p /调用memcmp并以这种方式比较字节数组。 I find it hard to imagine that could be beaten. 我很难想象会被击败。 Obviously you'd do a comparison of the lengths first and only call memcmp if the two byte arrays had the same length. 显然,您需要先比较长度,如果两个字节数组的长度相同,则仅调用memcmp

The p/invoke looks like this: p /调用看起来像这样:

[DllImport("msvcrt.dll", CallingConvention=CallingConvention.Cdecl)]
static extern int memcmp(byte[] lhs, byte[] rhs, UIntPtr count);

But you should only contemplate this if you really do care about speed, and the pure managed alternatives are too slow for you. 但是,只有在真正关心速度的情况下才应该考虑这一点,而纯粹的托管替代方法对于您来说太慢了。 So, do some timings to make sure you are not optimising prematurely. 因此,请执行一些计时以确保您没有过早地进行优化。 Well, even to make sure that you are optimising at all. 好吧,甚至可以确保您正在进行优化。

I'd be very surprised if converting to string was fast. 如果转换为string速度很快,我会感到非常惊讶。 I'd expect it to be slow. 我希望它会很慢。 And in fact I'd expect your code to fail because there's no reason for your byte arrays to be valid UTF-8. 实际上,我希望您的代码会失败,因为没有理由让您的字节数组成为有效的UTF-8。 Just forget you ever had that idea! 只是忘记您曾经有过这个想法!

Compare ZipEntry.Crc and ZipEntry.UncompressedSize of the two files, only if they match uncompress and do the byte comparison. 当两个文件匹配uncompress并进行字节比较时, 比较两个文件的ZipEntry.CrcZipEntry.UncompressedSize If the two files are the same, their CRC and Size will be the same too. 如果两个文件相同,则它们的CRC和大小也将相同。 This strategy will save you a ton of CPU cycles. 这种策略将为您节省大量的CPU周期。

ZipEntry zipentry1 = zip1["f1.dll"];
ZipEntry zipentry2 = zip2["f2.dll"];

if (zipentry1.Crc == zipentry2.Crc && zipentry1.UncompressedSize == zipentry2.UncompressedSize)
{
    // uncompress
    zipentry1.Extract(ms1);
    zipentry2.Extract(ms2);

    byte[] b1 = new byte[ms1.Length];
    byte[] b2 = new byte[ms2.Length];

    ms1.Seek(0, SeekOrigin.Begin);
    ms2.Seek(0, SeekOrigin.Begin);

    ms1.BeginRead(b1, 0, (int) ms1.Length, null, null);
    ms2.BeginRead(b2, 0, (int) ms2.Length, null, null);

    // perform a byte comparison
    if (Enumerable.SequenceEqual(b1, b2)) // or a simple for loop
    {
        // files are the same
    }
    else
    {
        // files are different
    }
}
else
{
    // files are different
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM