简体   繁体   中英

Remove items from one list if they contain strings from another list

I'm looking for the most efficient way to remove items from one list if they contain strings from another list.

For example:

B list contains:

TomWentFishing
SueStayedHome
JohnGoesToSchool
JimPlaysTennis

A list contains:

GoesToSchool
SueStayed

C list should contain:

TomWentFishing
JimPlaysTennis

I've used this code, but it takes up a lot of time as the lists are very large:

static void Main(string[] args)
    {
        string[] b = File.ReadAllLines(@"C:\b.txt");
        string[] a = File.ReadAllLines(@"C:\a.txt");

        foreach (string firststring in b)
        {
            bool contains = false;
            foreach (string secondstring in a)
            {
                if (firststring.ToLower().Contains(secondstring.ToLower()))
                {
                    contains = true;
                    break;
                }
            }

            if (contains == false)
            {
                File.AppendAllText(@"C:\c.txt", firststring + Environment.NewLine);
            }


        }

    }

You can make this significantly faster if you can sort the a list into something that can support binary (or faster) lookups.

Unfortunately, the Contains() search makes this challenging. But there are still some things we can do:

  • Avoid loading all of b into RAM. Ever.
  • On the other hand, lookups into a will be faster if we preload into RAM once , and do as much work to support the lookups for this one copy as we can.
  • Only convert b to lower case once, instead of again for each line in a .
  • It will be more efficient to do all of the write operations at once, rather than re-opening the output file to append the lines as we find them.
  • As a bonus, we'll do all this in significantly less code.
static void Main(string[] args)
{
    var b = File.ReadLines(@"C:\b.txt");
    var a = File.ReadLines(@"C:\a.txt").Select(line => line.ToLower()).ToList();

    var result = b.Where(bline => {
       var lowered = bline.ToLower();
       return !a.Any(aline => lowered.Contains(aline));
    });

    File.AppendAllLines(@"C:\c.txt", result);
}

Here you have a very efficient implementation based on hash set, it is linear time complexity O(n). This avoids you to iterate over all lines of a.txt file for each line in b.txt file which leads to quadratic time complexity O(n^2).

This approach is good if a hash set with all the lines of a.txt files fits in memory. It it doesn't fit in memory then you need to use something like RocksDb.

First you have this extension method:

public static class EnumerableStringExtensions
{
    public static IEnumerable<string> Minus(
        this IEnumerable<string> minuend, 
        IEnumerable<string> subtrahend, 
        StringComparison comparisonType)
    {
        var subtrahendSet = new HashSet<string>(subtrahend, StringComparer.FromComparison(comparisonType));
        return minuend.Where(x => subtrahendSet.Contains(x) == false);
    }
}

And you can use it like this:

public class Program
{
    public static IEnumerable<string> EnumerateLines(string filePath)
    {
        using (var reader = File.OpenText(filePath))
        {
            string line;
            while ((line = reader.ReadLine()) != null)
            {
                yield return line;
            }
        }
    }

    static void Main(string[] args)
    {
        var minuend = EnumerateLines("b.txt");
        var sustraend = EnumerateLines("a.txt");
        var difference = minuend.Minus(sustraend, StringComparison.OrdinalIgnoreCase);
        File.WriteAllLines("difference.txt", difference);

    }
}

Note that with this implementation you don't need to keep all lines from b.txt file in memory at once. But you need a hash set with all lines from a.txt

If the issue is high memory usage because of large file size, then one file you have read anyway but for other one instead of reading the entire file in memory directly, you can use FileInputStream with BufferedReader to read lines one by one. This will reduce some memory usage

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM