简体   繁体   中英

C# Fastest way to intersect lists of strings

I am using hashet , linq Intersect() and Count() to find intersection of two lists of strings.

Code being used

private HashSet<string> Words { get; }

public Sentence(IEnumerable<string> words)
{
    Words = words.ToHashSet();
}

public int GetSameWordCount(Sentence sentence)
{
    return Words.Intersect(sentence.Words).Count();
}

Method GetSameWordCount is Taking > 90% of program runtime as there are milions of Sentences to compare with each other.

Is there any faster way to do this?

I am using .net core 3.1.1 / C# 8 so any recent features can be used.

More info:
Input data is coming from text file (eg book excerpt, articles from web). Sentences are then unaccented, lowercased and split to words by whitespace >regex. Short words (<3 length) are ignored.
I am creating groups of sentences which have N words in common and ordering >these groups by number of shared words.

The below code will utilize HashSet<T>.Contains method which is more performant. Time complexity of HashSet<T>.Contains is O(1).

public int GetSameWordCount(Sentence sentence)
{
    var count;
    foreach(var word in sentence.Words)
    {
         if(Words.Contains(word))
             count++;
    }
    return count;
}

Note

If the list of the words is sorted you can use below approach.

        var enumerator1 = set1.GetEnumerator();
        var enumerator2 = set2.GetEnumerator();
        var count = 0;
        if (enumerator1.MoveNext() && enumerator2.MoveNext())
        {
            while (true)
            {
                var value = enumerator1.Current.CompareTo(enumerator2.Current);
                if (value == 0)
                {
                    count++;
                    if (!enumerator1.MoveNext() || !enumerator2.MoveNext())
                        break;
                }
                else if (value < 0)
                {
                    if (!enumerator1.MoveNext())
                        break;
                }
                else
                {
                    if (!enumerator2.MoveNext())
                        break;
                }
            }
        }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM