简体   繁体   中英

Occurrences of a List<string> in a string C#

Given

var stringList = new List<string>(new string[] {
                   "outage","restoration","efficiency"});

var queryText = "While walking through the park one day, I noticed an outage",
              "in the lightbulb at the plant. I talked to an officer about", 
              "restoration protocol for public works, and he said to contact",
              "the department of public works, but not to expect much because",
              "they have low efficiency."

How do I get the overall number of occurances of all strings in stringList from queryText?

In the above example, I would want a method that returned 3;

private int stringMatches (string textToQuery, string[] stringsToFind)
{
    //
}

RESULTS

SPOKE TOO SOON!

Ran a couple of performance tests, and this branch of code from Fabian was faster by a good margin:

private int stringMatches(string textToQuery, string[] stringsToFind)
{
    int count = 0;
    foreach (var stringToFind in stringsToFind)
    {
        int currentIndex = 0;

    while ((currentIndex = textToQuery.IndexOf(stringToFind , currentIndex, StringComparison.Ordinal)) != -1)
    {
       currentIndex++;
       count++;
    }
    }
    return count;
}

Execution Time: On a 10000 iteration loop using stopwatch:

Fabian: 37-42 milliseconds

lazyberezovsky StringCompare: 400-500 milliseconds

lazyberezovsky Regex: 630-680 milliseconds

Glenn: 750-800 milliseconds

(Added StringComparison.Ordinal to Fabians answer for additional speed.)

That might also be fast:

private int stringMatches(string textToQuery, string[] stringsToFind)
{
  int count = 0;
  foreach (var stringToFind in stringsToFind)
  {
    int currentIndex = 0;

    while ((currentIndex = textToQuery.IndexOf(stringToFind , currentIndex, StringComparison.Ordinal)) != -1)
    {
     currentIndex++;
     count++;
    }
  }
  return count;
}

This LINQ query splits text by spaces and punctuation symbols, and searches matches ignoring case

private int stringMatches(string textToQuery, string[] stringsToFind)
{
   StringComparer comparer = StringComparer.CurrentCultureIgnoreCase;
   return textToQuery.Split(new []{' ', '.', ',', '!', '?'}) // add more if need
                     .Count(w => stringsToFind.Contains(w, comparer));
}

Or with regular expression:

private static int stringMatches(string textToQuery, string[] stringsToFind)
{
    var pattern = String.Join("|", stringsToFind.Select(s => @"\b" + s + @"\b"));
    return Regex.Matches(textToQuery, pattern, RegexOptions.IgnoreCase).Count;
}

If you want to count the words in the string that are in the other collection:

private int stringMatches(string textToQuery, string[] stringsToFind)
{
    return textToQuery.Split().Intersect(stringsToFind).Count();
}

I like Tim's answer, but I try to avoid making too many strings to avoid performance issues, and I do like regular expressions, so here's another way to go:

private int StringMatches(string searchMe, string[] keys)
{
    System.Text.RegularExpressions.Regex expression = new System.Text.RegularExpressions.Regex(string.Join("|", keys), System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    return expression.Matches(searchMe).Count;
}

This is a revision of Fabian Bigler's original answer. It is about a 33% speed improvement mostly because of StringComparison.Ordinal.

Here's a link for more info on this: http://msdn.microsoft.com/en-us/library/bb385972.aspx

    private int stringMatches(string textToQuery, List<string> stringsToFind)
    {
        int count = 0, stringCount = stringsToFind.Count(), currentIndex;
        string stringToFind;
        for (int i = 0; i < stringCount; i++)
        {
            currentIndex = 0;
            stringToFind = stringsToFind[i];
            while ((currentIndex = textToQuery.IndexOf(stringToFind, currentIndex, StringComparison.Ordinal)) != -1)
            {
                currentIndex++;
                count++;
            }
        }
        return count;
    }

This will match only the words of your TextToQuery:

The idea of this is to check if the index before and after the match is not a letter. Also, I had to make sure to check if it's the start or end of the string.

  private int stringMatchesWordsOnly(string textToQuery, string[] wordsToFind)
        {
            int count = 0;
            foreach (var wordToFind in wordsToFind)
            {
                int currentIndex = 0;
                while ((currentIndex = textToQuery.IndexOf(wordToFind, currentIndex,         StringComparison.Ordinal)) != -1)
                {
                    if (((currentIndex == 0) || //is it the first index?
                          (!Char.IsLetter(textToQuery, currentIndex - 1))) &&
                          ((currentIndex == (currentIndex + wordToFind.Length)) || //has the end been reached?
                          (!Char.IsLetter(textToQuery, currentIndex + wordToFind.Length))))
                    {
                        count++;
                    }
                    currentIndex++;
                }
            }
            return count;
        }

Conclusion: As you can see this approach is a bit messier than my other answer and will be less performant (Still more performant than the other answers, though). So it really depends on what you want to achieve. If you have short words in your strings to find, you should probably take this answer, because eg an 'and' would obviously return too many matches with the first approach.

private int stringMatches(string textToQuery, string[] stringsToFind)
{
      string[] splitArray = textToQuery.Split(new char[] { ' ', ',','.' });
      var count = splitArray.Where(p => stringsToFind.Contains(p)).ToArray().Count();
      return count;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM