I am using hashet
, linq Intersect()
and Count()
to find intersection of two lists of strings.
Code being used
private HashSet<string> Words { get; }
public Sentence(IEnumerable<string> words)
{
Words = words.ToHashSet();
}
public int GetSameWordCount(Sentence sentence)
{
return Words.Intersect(sentence.Words).Count();
}
Method GetSameWordCount
is Taking > 90% of program runtime as there are milions of Sentences to compare with each other.
Is there any faster way to do this?
I am using .net core 3.1.1 / C# 8 so any recent features can be used.
More info:
Input data is coming from text file (eg book excerpt, articles from web). Sentences are then unaccented, lowercased and split to words by whitespace >regex. Short words (<3 length) are ignored.
I am creating groups of sentences which have N words in common and ordering >these groups by number of shared words.
The below code will utilize HashSet<T>.Contains
method which is more performant. Time complexity of HashSet<T>.Contains
is O(1).
public int GetSameWordCount(Sentence sentence)
{
var count;
foreach(var word in sentence.Words)
{
if(Words.Contains(word))
count++;
}
return count;
}
Note
If the list of the words is sorted you can use below approach.
var enumerator1 = set1.GetEnumerator();
var enumerator2 = set2.GetEnumerator();
var count = 0;
if (enumerator1.MoveNext() && enumerator2.MoveNext())
{
while (true)
{
var value = enumerator1.Current.CompareTo(enumerator2.Current);
if (value == 0)
{
count++;
if (!enumerator1.MoveNext() || !enumerator2.MoveNext())
break;
}
else if (value < 0)
{
if (!enumerator1.MoveNext())
break;
}
else
{
if (!enumerator2.MoveNext())
break;
}
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.