简体   繁体   English

快速比较两个巨大文本文件的内容

[英]comparing the contents of two huge text files quickly

what i'm basically trying to do is compare two HUGE text files and if they match write out a string, i have this written but it's extremely slow. 我基本上想做的是比较两个巨大的文本文件,如果匹配,则写出一个字符串,我已经写过了,但是速度非常慢。 I was hoping you guys might have a better idea. 我希望你们可能有一个更好的主意。 In the below example i'm comparing collect[3] splitfound[0] 在下面的示例中,我正在比较collect [3] splitfound [0]

        string[] collectionlist = File.ReadAllLines(@"C:\found.txt");
        string[] foundlist = File.ReadAllLines(@"C:\collection_export.txt");
        foreach (string found in foundlist)
        {
            string[] spltifound = found.Split('|');
            string matchfound = spltifound[0].Replace(".txt", ""); ;
            foreach (string collect in collectionlist)
            {
                string[] splitcollect = collect.Split('\\');
                string matchcollect = splitcollect[3].Replace(".txt", "");
                if (matchcollect == matchfound)
                {
                    end++;
                   long finaldest = (start - end);
                   Console.WriteLine(finaldest);
                    File.AppendAllText(@"C:\copy.txt", "copy \"" + collect + "\" \"C:\\OUT\\" + spltifound[1] + "\\" + spltifound[0] + ".txt\"\n");
                    break;
                }



            }

        }

Sorry for the vagueness guys, 抱歉,模糊的人,

What I'm trying to do is simply say if content from one file exists in another write out a string(the string isn't important, merely the time to find the two comparatives is). 我想做的只是简单地说,如果一个文件中的内容存在于另一个文件中,则写出一个字符串(该字符串并不重要,只是找到两个比较对象的时间是)。 collectionlist is like this: 收藏列表是这样的:
Apple|Farm 苹果|农场

foundlist is like this Foundlist是这样的
C:\\cow\\horse\\turtle.txt C:\\牛\\马\\ turtle.txt
C:\\cow\\pig\\apple.txt C:\\牛\\猪\\ apple.txt

what i'm doing is taking apple from collectionlist, and finding the line that contains apple in foundlist. 我正在做的是从collectionlist中获取apple,并在foundlist中找到包含apple的行。 Then writing out a basic windows copy batch file. 然后写出一个基本的Windows复制批处理文件。 Sorry for the confusion. 对困惑感到抱歉。

Answer(All credit to Slaks) 答案(全部归功于Slaks)

               string[] foundlist = File.ReadAllLines(@"C:\found.txt");
           var collection = File.ReadLines(@"C:\collection_export.txt")
        .ToDictionary(s => s.Split('|')[0].Replace(".txt",""));

        using (var writer = new StreamWriter(@"C:\Copy.txt"))
        {
            foreach (string found in foundlist)
            {
                string[] splitFound = found.Split('\\');
                string matchFound = Path.GetFileNameWithoutExtension(found);

                string collectedLine;
                if (collection.TryGetValue(matchFound,out collectedLine))
                {
                    string[] collectlinesplit = collectedLine.Split('|');
                    end++;
                    long finaldest = (start - end);
                    Console.WriteLine(finaldest);
                    writer.WriteLine("copy \"" + found + "\" \"C:\\O\\" + collectlinesplit[1] + "\\" + collectlinesplit[0] + ".txt\"");
                }
            }
        }
  • Call File.ReadLines() (.NET 4) instead of ReadAllLines() (.NET 2.0). 调用File.ReadLines() (.NET 4),而不是ReadAllLines() (.NET 2.0)。
    ReadAllLines needs to build an array to hold the return value, which can be extremely slow for large files. ReadAllLines需要构建一个数组来保存返回值,这对于大型文件而言可能非常慢。
    If you're not using .Net 4.0, replace it with a StreamReader. 如果您不使用.Net 4.0,则将其替换为StreamReader。

  • Build a Dictionary<string, string> with the matchCollect s (once), then loop through the foundList and check whether the HashSet contains matchFound . 使用matchCollect (一次)构建一个Dictionary<string, string> ,然后遍历foundList并检查HashSet是否包含matchFound
    This allows you to replace the O(n) inner loop with an O(1) hash check 这使您可以用O(1)哈希检查替换O(n)内部循环

  • Use a StreamWriter instead of calling AppendText 使用StreamWriter而不是调用AppendText

  • EDIT : Call Path.GetFileNameWithoutExtension and the other Path methods instead of manually manipulating strings. 编辑 :调用Path.GetFileNameWithoutExtension和其他Path方法,而不是手动操作字符串。

For example: 例如:

var collection = File.ReadLines(@"C:\found.txt")
    .ToDictionary(s => s.Split('\\')[3].Replace(".txt", ""));

using (var writer = new StreamWriter(@"C:\Copy.txt")) {
    foreach (string found in foundlist) {
        string splitFound = found.Split('|');
        string matchFound = Path.GetFileNameWithoutExtension(found)

        string collectedLine;
        if (collection.TryGetValue(matchFound, collectedLine)) {
            end++;
            long finaldest = (start - end);
            Console.WriteLine(finaldest);
            writer.WriteLine("copy \"" + collectedLine + "\" \"C:\\OUT\\" 
                           + splitFound[1] + "\\" + spltifound[0] + ".txt\"");
        }
    }
}

First I'd suggest normalizing both files and putting one of them in a set. 首先,我建议对两个文件进行规范化,然后将其中一个放入一组。 This allows you to quickly test whether a specific line is present and reduces the complexity from O(n*n) to O(n). 这使您可以快速测试是否存在特定的行,并将复杂度从O(n * n)降低到O(n)。

Also you shouldn't open and close the file every time you write a line: 另外,您不应该在每次写一行时打开和关闭文件:

File.AppendAllText(...); // This causes the file to be opened and closed.

Open the output file once at the start of the operation, write lines to it, then close it when all lines have been written. 在操作开始时,一次打开输出文件,向其中写入行,然后在写完所有行后将其关闭。

You have a cartesian product, so it makes sense to index one side instead of doing an enhaustive linear search. 您具有笛卡尔积,因此有意义的是一侧索引而不是进行详尽的线性搜索。

Extract the keys from one file and use either a Set or SortedList data structure to hold them. 从一个文件中提取密钥,并使用Set或SortedList数据结构来保存它们。 This will make the lookups much much faster. 这将使查找快得多。 (Your overall algorithm will be O(N lg N) instead of O(N**2) ) (您的总体算法将是O(N lg N)而不是O(N ** 2))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM