简体   繁体   English

如何循环并比较两个文本文件中的数百万个值?

[英]How to loop through and compare millions of values in two text files?

I have two text files files (TXT) which contain over 2 million distinct file names. 我有两个文本文件文件(TXT),其中包含超过200万个不同的文件名。 I want to loop through all the names in the first file and find those that are also present in the second text file. 我想遍历第一个文件中的所有名称,并找到第二个文本文件中也存在的名称。

I have tried looping through the StreamReader but it takes a lot of time. 我试过循环StreamReader但需要花费很多时间。 I also tried the code below, but it still takes too much time. 我也尝试了下面的代码,但它仍然需要太多时间。

StreamReader first = new StreamReader(path);
string strFirst = first.ReadToEnd();
string[] strarrFirst = strFirst.Split('\n');

 bool found = false;

StreamReader second = new StreamReader(path2);
string str = second.ReadToEnd();
string[] strarrSecond = str.Split('\n');

for (int j = 0; j < (strarrFirst.Length); j++)
{
          found = false;

    for (int i = 0; i < (strarrSecond .Length); i++)
    {
        if (strarrFirst[j] == strarrSecond[i])
        {
            found = true;
            break;
        }
    }

    if (!found)
    {
        Console.WriteLine(strarrFirst[j]);
    }
}

What is a good way to compare the files? 有什么比较文件的好方法?

How about this: 这个怎么样:

var commonNames = File.ReadLines(path).Intersect(File.ReadLines(path2));

That's O(N + M) instead of your current solution which tests every line in the first file with every line in the second file - O(N * M). 这是O(N + M),而不是该测试在第一档与第二档行当前解决方案- O(N * M)。

That's assuming you're using .NET 4. Otherwise, you could use File.ReadAllLines , but that will read the whole file into memory. 这是假设您使用的是.NET 4.否则,您可以使用File.ReadAllLines ,但这会将整个文件读入内存。 Or you could write the equivalent of File.ReadLines yourself - it's not terribly hard. 或者你可以自己编写相当于File.ReadLines东西 - 这并不是特别难。

Ultimately you're likely to be limited by file IO by the time you've got rid of the O(N * M) problem in your current code - there's not much way to get round that. 最终,当你摆脱当前代码中的O(N * M)问题时,你可能会被文件IO限制 - 没有太多方法可以解决这个问题。

EDIT: For .NET 2, first let's implement something like ReadLines : 编辑:对于.NET 2,首先让我们实现像ReadLines这样的东西:

public static IEnumerable<string> ReadLines(string file)
{
    using (TextReader reader = File.OpenText(file))
    {
        string line;
        while ((line = reader.ReadLine()) != null)
        {
            yield return line;
        }
    }
}

Now we really want to use a HashSet<T> , but that wasn't in .NET 2 - so let's use Dictionary<TKey, TValue> instead: 现在我们真的想要使用HashSet<T> ,但这不是在.NET 2中 - 所以让我们使用Dictionary<TKey, TValue>代替:

Dictionary<string, string> map = new Dictionary<string, string>();
foreach (string line in ReadLines(path))
{
    map[line] = line;
}

List<string> intersection = new List<string>();
foreach (string line in ReadLines(path2))
{
    if (map.ContainsKey(line))
    {
        intersection.Add(line);
    }
}

Try something like this to speed it up a bit ... 试试这样的东西加快一点......

            var path = string.Empty;
            var path2 = string.Empty;
            var strFirst = string.Empty;
            var str = string.Empty;
            var strarrFirst = new List<string>();
            var strarrSecond = new List<string>();

            using (var first = new StreamReader(path))
            {
                strFirst = first.ReadToEnd();
            }

            using (var second = new StreamReader(path2))
            {
                str = second.ReadToEnd();
            }


            strarrFirst.AddRange(strFirst.Split('\n'));

            strarrSecond.AddRange(str.Split('\n'));
            strarrSecond.Sort();

            foreach(var value in strarrFirst)
            {
                var found = strarrSecond.BinarySearch(value) >= 0;
                if (!found) Console.WriteLine(value);
            }

Just for fun, I've tried Jon Skeet method and own: 只是为了好玩,我尝试了Jon Skeet方法并拥有:

    var guidArray = Enumerable.Range(0, 1000000).Select(x => Guid.NewGuid().ToString()).ToList();
        string path = "first.txt";
        File.WriteAllLines(path, guidArray);
        string path2 = "second.txt";
        File.WriteAllLines(path2, guidArray.Select(x=>DateTime.UtcNow.Ticks % 2 == 0 ? x : Guid.NewGuid().ToString()));

        var start = DateTime.Now;

        var commonNames = File.ReadLines(path).Intersect(File.ReadLines(path2)).ToList();

        Console.WriteLine((DateTime.Now - start).TotalMilliseconds);

        start = DateTime.Now;

        var lines = File.ReadAllLines(path);
        var hashset = new HashSet<string>(lines);

        var lines2 = File.ReadAllLines(path2);

        var result = lines2.Where(hashset.Contains).ToList();

        Console.WriteLine((DateTime.Now - start).TotalMilliseconds);
        Console.ReadKey();

And Skeet's method was tiny bit faster (1453.0831 vs 1488.0851, iDevForFun method was quite slow - 12791.7316), so i think under layers should happen same thing as I was trying to do manually with hashset. 并且Skeet的方法稍快一点(1453.0831 vs 1488.0851,iDevForFun方法相当慢 - 12791.7316),所以我认为在图层下应该发生与我试图用hashset手动做同样的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM