快速比较两个巨大文本文件的内容

Question

我基本上想做的是比较两个巨大的文本文件，如果匹配，则写出一个字符串，我已经写过了，但是速度非常慢。 我希望你们可能有一个更好的主意。 在下面的示例中，我正在比较collect [3] splitfound [0]

        string[] collectionlist = File.ReadAllLines(@"C:\found.txt");
        string[] foundlist = File.ReadAllLines(@"C:\collection_export.txt");
        foreach (string found in foundlist)
        {
            string[] spltifound = found.Split('|');
            string matchfound = spltifound[0].Replace(".txt", ""); ;
            foreach (string collect in collectionlist)
            {
                string[] splitcollect = collect.Split('\\');
                string matchcollect = splitcollect[3].Replace(".txt", "");
                if (matchcollect == matchfound)
                {
                    end++;
                   long finaldest = (start - end);
                   Console.WriteLine(finaldest);
                    File.AppendAllText(@"C:\copy.txt", "copy \"" + collect + "\" \"C:\\OUT\\" + spltifound[1] + "\\" + spltifound[0] + ".txt\"\n");
                    break;
                }



            }

        }

抱歉，模糊的人，

我想做的只是简单地说，如果一个文件中的内容存在于另一个文件中，则写出一个字符串（该字符串并不重要，只是找到两个比较对象的时间是）。 收藏列表是这样的：
苹果|农场

Foundlist是这样的
C：\\牛\\马\\ turtle.txt
C：\\牛\\猪\\ apple.txt

我正在做的是从collectionlist中获取apple，并在foundlist中找到包含apple的行。 然后写出一个基本的Windows复制批处理文件。 对困惑感到抱歉。

答案（全部归功于Slaks）

               string[] foundlist = File.ReadAllLines(@"C:\found.txt");
           var collection = File.ReadLines(@"C:\collection_export.txt")
        .ToDictionary(s => s.Split('|')[0].Replace(".txt",""));

        using (var writer = new StreamWriter(@"C:\Copy.txt"))
        {
            foreach (string found in foundlist)
            {
                string[] splitFound = found.Split('\\');
                string matchFound = Path.GetFileNameWithoutExtension(found);

                string collectedLine;
                if (collection.TryGetValue(matchFound,out collectedLine))
                {
                    string[] collectlinesplit = collectedLine.Split('|');
                    end++;
                    long finaldest = (start - end);
                    Console.WriteLine(finaldest);
                    writer.WriteLine("copy \"" + found + "\" \"C:\\O\\" + collectlinesplit[1] + "\\" + collectlinesplit[0] + ".txt\"");
                }
            }
        }

Answer 1

调用File.ReadLines（）（.NET 4），而不是ReadAllLines（）（.NET 2.0）。
ReadAllLines需要构建一个数组来保存返回值，这对于大型文件而言可能非常慢。
如果您不使用.Net 4.0，则将其替换为StreamReader。
使用matchCollect （一次）构建一个Dictionary<string, string> ，然后遍历foundList并检查HashSet是否包含matchFound 。
这使您可以用O（1）哈希检查替换O（n）内部循环
使用StreamWriter而不是调用AppendText
编辑：调用Path.GetFileNameWithoutExtension和其他Path方法，而不是手动操作字符串。

例如：

var collection = File.ReadLines(@"C:\found.txt")
    .ToDictionary(s => s.Split('\\')[3].Replace(".txt", ""));

using (var writer = new StreamWriter(@"C:\Copy.txt")) {
    foreach (string found in foundlist) {
        string splitFound = found.Split('|');
        string matchFound = Path.GetFileNameWithoutExtension(found)

        string collectedLine;
        if (collection.TryGetValue(matchFound, collectedLine)) {
            end++;
            long finaldest = (start - end);
            Console.WriteLine(finaldest);
            writer.WriteLine("copy \"" + collectedLine + "\" \"C:\\OUT\\" 
                           + splitFound[1] + "\\" + spltifound[0] + ".txt\"");
        }
    }
}

Answer 2

首先，我建议对两个文件进行规范化，然后将其中一个放入一组。 这使您可以快速测试是否存在特定的行，并将复杂度从O（n * n）降低到O（n）。

另外，您不应该在每次写一行时打开和关闭文件：

File.AppendAllText(...); // This causes the file to be opened and closed.

在操作开始时，一次打开输出文件，向其中写入行，然后在写完所有行后将其关闭。

Answer 3

您具有笛卡尔积，因此有意义的是一侧索引而不是进行详尽的线性搜索。

从一个文件中提取密钥，并使用Set或SortedList数据结构来保存它们。 这将使查找快得多。 （您的总体算法将是O（N lg N）而不是O（N ** 2））

快速比较两个巨大文本文件的内容

问题描述

3 个解决方案

解决方案1
4 已采纳 2010-12-08 14:13:02

解决方案2
1 2010-12-08 14:12:10

解决方案3
1 2010-12-08 14:12:18

快速比较两个巨大文本文件的内容

问题描述

3 个解决方案

解决方案1 4 已采纳 2010-12-08 14:13:02

解决方案2 1 2010-12-08 14:12:10

解决方案3 1 2010-12-08 14:12:18

解决方案1
4 已采纳 2010-12-08 14:13:02

解决方案2
1 2010-12-08 14:12:10

解决方案3
1 2010-12-08 14:12:18