C# - Fuzzy compare of two large string arrays

Question

I need to find all strings in B that "partly" exists in A.

B = [ "Hello World!", "Hello Stack Overflow!", "Foo Bar!", "Food is nice...", "Hej" ]
A = [ "World", "Foo" ]
C = B.FuzzyCompare(A) // C = [ "Hello World!", "Foo Bar!", "Food is nice..." ]

I've been looking into using Levenshtein Distance Algorithm for the "fuzzy" part of the problem, as well as LINQ for the iterations. However, A * B usually results in over 1,5 billion comparisons.

How should i go about this? Is there a way to quickly "almost compare" two Lists of strings?

Answer 1

也许仅仅比较子串就足够了，这会更有效：

var C = B.Where(s1 => A.Any(s2 => s1.IndexOf(s2, StringComparison.OrdinalIgnoreCase) >= 0)).ToList();

Answer 2

This seems like a good use of a Suffix Trie .

A Suffix Trie is a tree with no payload. It indexes all suffixes of a given string or sentence so that they can be searched in O(n) time. So, if your input in A was "hello", it would index "hello", "ello", "llo", "lo", and "o" in a way that would allow any of those substrings to immediately and efficiently be looked up without any additional enumeration of the set of A .

Basically, take all the values in A and process them into a Suffix Trie which is an O(n * m) operation done once where n is the number of elements in A and m is the length of the elements. Then for each element of B check for it in the Suffix Trie which is also an O(n * m) operation where n is the number of elements in B and m is the length of the elements.

Answer 3

I think you may be other thinking it:

List<string> results = new List<string>();
foreach (string test in B)
{
   if (A.Any(a => test.Contains(a))
      results.Add(test);
}

BTW the complexity of this is somewhere in the region of O(n) (best) and O(n*m) (worst) (where n is the numer of results in A and m is the number of results in B )

C# - Fuzzy compare of two large string arrays

Question

3 answers

solution1
5 ACCPTED 2016-07-12 15:01:36

solution2
4 2016-07-12 15:00:15

solution3
3 2016-07-12 15:00:22

C# - Fuzzy compare of two large string arrays

Question

3 answers

solution1 5 ACCPTED 2016-07-12 15:01:36

solution2 4 2016-07-12 15:00:15

solution3 3 2016-07-12 15:00:22

solution1
5 ACCPTED 2016-07-12 15:01:36

solution2
4 2016-07-12 15:00:15

solution3
3 2016-07-12 15:00:22