在2个文件的字节数组中相交和并集

Question

I have 2 files. 我有2个档案。 1 is Source File and 2nd is Destination file. 1是源文件，第二是目标文件。

Below is my code for Intersect and Union two file using byte array. 下面是我使用字节数组的“相交”和“并集”两个文件的代码。

FileStream frsrc = new FileStream("Src.bin", FileMode.Open);
FileStream frdes = new FileStream("Des.bin", FileMode.Open);
int length = 24; // get file length
byte[] src = new byte[length];
byte[] des = new byte[length]; // create buffer
int Counter = 0;   // actual number of bytes read
int subcount = 0;

while (frsrc.Read(src, 0, length) > 0)
{
    try
    {
        Counter = 0;
        frdes.Position = subcount * length;
        while (frdes.Read(des, 0, length) > 0)
        {                               
                var  data = src.Intersect(des);                          
                var data1 = src.Union(des);                               
                Counter++;                               
        }        
        subcount++;
        Console.WriteLine(subcount.ToString());
        }
    }
    catch (Exception ex)
    {                          
    }
}

It is works fine with fastest speed. 它以最快的速度运行良好。 but Now the problem is that I want count of it and when I Use below code then it becomes very slow. 但是现在的问题是，我想要计数，当我使用下面的代码时，它变得非常慢。

  var  data = src.Intersect(des).Count();                          
  var  data1 = src.Union(des).Count();

So, Is there any solution for that ? 那么，有什么解决方案吗？ If yes,then please lete me know as soon as possible. 如果是，那么请尽快通知我。 Thanks 谢谢

Answer 1

Intersect and Union are not the fastest operations. Intersect和Union并不是最快的操作。 The reason you see it being fast is that you never actually enumerate the results! 之所以看到它很快，是因为您从未真正枚举结果！

Both return an enumerable, not the actual results of the operation. 两者都返回一个可枚举的值，而不是操作的实际结果。 You're supposed to go through that and enumerate the enumerable, otherwise nothing happens - this is called "deferred execution". 您应该仔细检查并枚举可枚举的对象，否则什么也不会发生-这称为“延迟执行”。 Now, when you do Count , you actually enumerate the enumerable, and incur the full cost of the Intersect and Union - believe me, the Count itself is relatively trivial (though still an O(n) operation!). 现在，当你Count ，你实际上枚举枚举，而招致的全部费用Intersect和Union -相信我，在Count本身是比较琐碎的（尽管仍然是一个O（n）的操作！）。

You'll need to make your own methods, most likely. 您很有可能需要制作自己的方法。 You want to avoid the enumerable overhead, and more importantly, you'll probably want a lookup table. 您希望避免大量的开销，更重要的是，您可能需要查找表。

Answer 2

A few points: the comment // get file length is misleading as it is the buffer size. 要点：注释// get file length是误导性的，因为它是缓冲区的大小。 Counter is not the number of bytes read, it is the number of blocks read. Counter不是读取的字节数，而是读取的块数。 data and data1 will end up with the result of the last block read, ignoring any data before them. data和data1将以最后读取的块的结果结束，而忽略它们之前的任何数据。 That is assuming that nothing goes wrong in the while loop - you need to remove the try structure to see if there are any errors. 假设在while循环中没有任何问题-您需要删除try结构以查看是否存在任何错误。

What you can do is count the number of occurences of each byte in each file, then if the count of a byte in any file is greater than one then it is is a member of the intersection of the files, and if the count of a byte in all the files is greater than one then it is a member of the union of the files. 您可以做的是计算每个文件中每个字节的出现次数，然后，如果任何文件中的字节数大于1，则它是文件交集的成员，并且如果所有文件中的字节大于1，则它是文件并集的成员。

It is just as easy to write the code for more than two files as it is for two files, whereas LINQ is easy for two but a little bit more fiddly for more than two. 为两个以上的文件编写代码就像为两个文件编写代码一样容易，而LINQ对于两个文件来说很容易，但是对于两个以上的文件来说却有点麻烦。 (I put in a comparison with using LINQ in a naïve fashion for only two files at the end.) （在最后只比较两个文件的情况下，我比较天真地使用了LINQ。）

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {

            var file1 = @"C:\Program Files (x86)\Electronic Arts\Crysis 3\Bin32\Crysis3.exe"; // 26MB
            var file2 = @"C:\Program Files (x86)\Electronic Arts\Crysis 3\Bin32\d3dcompiler_46.dll"; // 3MB
            List<string> files = new List<string> { file1, file2 };

            var sw = System.Diagnostics.Stopwatch.StartNew();

            // Prepare array of counters for the bytes
            var nFiles = files.Count;
            int[][] count = new int[nFiles][];
            for (int i = 0; i < nFiles; i++)
            {
                count[i] = new int[256];
            }

            // Get the counts of bytes in each file
            int bufLen = 32768;
            byte[] buffer = new byte[bufLen];

            int bytesRead;

            for (int fileNum = 0; fileNum < nFiles; fileNum++)
            {
                using (var sr = new FileStream(files[fileNum], FileMode.Open, FileAccess.Read))
                {
                    bytesRead = bufLen;
                    while (bytesRead > 0)
                    {
                        bytesRead = sr.Read(buffer, 0, bufLen);
                        for (int i = 0; i < bytesRead; i++)
                        {
                            count[fileNum][buffer[i]]++;
                        }
                    }
                }
            }

            // Find which bytes are in any of the files or in all the files
            var inAny = new List<byte>(); // union
            var inAll = new List<byte>(); // intersect

            for (int i = 0; i < 256; i++)
            {
                Boolean all = true;
                for (int fileNum = 0; fileNum < nFiles; fileNum++)
                {
                    if (count[fileNum][i] > 0)
                    {
                        if (!inAny.Contains((byte)i)) // avoid adding same value more than once
                        {
                            inAny.Add((byte)i);
                        }
                    }
                    else
                    {
                        all = false;
                    }
                };

                if (all)
                {
                    inAll.Add((byte)i);
                };

            }

            sw.Stop();

            Console.WriteLine(sw.ElapsedMilliseconds);

            // Display the results
            Console.WriteLine("Union: " + string.Join(",", inAny.Select(x => x.ToString("X2"))));
            Console.WriteLine();
            Console.WriteLine("Intersect: " + string.Join(",", inAll.Select(x => x.ToString("X2"))));
            Console.WriteLine();

            // Compare to using LINQ.
            // N/B. Will need adjustments for more than two files.

            var srcBytes1 = File.ReadAllBytes(file1);
            var srcBytes2 = File.ReadAllBytes(file2);

            sw.Restart();

            var intersect = srcBytes1.Intersect(srcBytes2).ToArray().OrderBy(x => x);
            var union = srcBytes1.Union(srcBytes2).ToArray().OrderBy(x => x);

            Console.WriteLine(sw.ElapsedMilliseconds);

            Console.WriteLine("Union: " + String.Join(",", union.Select(x => x.ToString("X2"))));
            Console.WriteLine();
            Console.WriteLine("Intersect: " + String.Join(",", intersect.Select(x => x.ToString("X2"))));

            Console.ReadLine();

        }
    }
}

The counting-the-byte-occurences method is roughly five times faster than the LINQ method on my computer, even without the latter loading the files and on a range of file sizes (a few KB to a few MB). 按字节计数的方法大约比我计算机上的LINQ方法快五倍，即使后者没有加载文件并且文件大小范围也不同（几KB到几MB）。

在2个文件的字节数组中相交和并集

问题描述

2 个解决方案

解决方案1
1 已采纳 2015-11-28 09:04:17

解决方案2
1 2015-11-28 12:26:55

在2个文件的字节数组中相交和并集

问题描述

2 个解决方案

解决方案1 1 已采纳 2015-11-28 09:04:17

解决方案2 1 2015-11-28 12:26:55

解决方案1
1 已采纳 2015-11-28 09:04:17

解决方案2
1 2015-11-28 12:26:55