简体   繁体   中英

iterate through foreach loop - takes long for processing c#

I'm trying to iterate the items contained in list and find matching keywords (~100k) using regex. Can someone please suggest a good method to approach the performance issue associated with looping over this huge list of items?

List<string> words = new List<string> { "a","b",....~100k Items};

string pattern = @"\b(" + String.Join("|", words) + @")\b";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RgexOptions.Multiline | RegexOptions.Compiled);
MatchCollection mc = r.Matches(TextBox1.Text);

foreach (Match m in mc) {
  Label1.Text = r.Replace(TextBox1.Text, @"<b>$1</b>");
}

Thanks in advance for your help!

Your foreach is totally unessesary, as is the match collection, notice you never use the variable m in your foreach. You can simplify your code to

List<string> words = new List<string> { "a","b",....~100k Items};

string pattern = @"\b(" + String.Join("|", words) + @")\b";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RgexOptions.Multiline | RegexOptions.Compiled);

Label1.Text = r.Replace(TextBox1.Text, @"<b>$1</b>");

One thing you may want to tweak, if your words in your words list contains specal charactors that the regex engine may interpret as regex commands you can escape them by doing Regex.Escape

List<string> words = new List<string> { "a","b",....~100k Items};

//You need 
string pattern = @"\b(" + 
                 String.Join("|", 
                    words.Select(x=>Regex.Escape(x)) + //You need "using System.Linq;" to use "words.Select"
                 @")\b";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RgexOptions.Multiline | RegexOptions.Compiled);

Label1.Text = r.Replace(TextBox1.Text, @"<b>$1</b>");

If performance is a problem, I would suggest this alternate approach:

  1. Place the 100k words into a HashSet with a case-insensitive comparer. The hash set lookup complexity is O(1), ie, constant time.
  2. Use a regular expression to locate each word, and add the necessary formatting where the word appears in the HashSet.

The below code shows initialization:

List<string> words = new List<string>();
 // add words to list (omitted)

 // add words in list to a new hashset with a case insensitive comparer
 HashSet<string> wordsset
     = new HashSet<string>(l, StringComparer.InvariantCultureIgnoreCase);

Then you can process each word in the input text to identify words that are keywords, and format accordingly. The function will return the string with identified words formatted to bold (in hypertext).

string FormatWithSearchTerms(string input, HashSet<string> keywords)
{
        Regex r = new Regex(@"\b\w+\b"); // find individual words.
                                         // (Note: refinement may be needed for 
                                         // special cases, like words with 
                                         // embedded punctuation.)

        return r.Replace(input, (m) =>
        {
            string v = m.Value;
            if (keywords.Contains(v))
                return m.Result("<b>$0</b>");
            else return v;
        });
}

Running the code against a paragraph of text takes about a millisecond, with a wordlist of 109k English words.

Are you checking if specific words in a string of text match words in your 100K list?

If so, I would change the approach.

  • Step 1: Create a trie and use that to store all of your 100K words. A trie is basically a multidimensional array of nodes, where each node is a letter and an array of nodes (for the next letter in a word). You can google or check wiki for more info on the trie data structure. For a good but less efficient solution, use a HashSet of strings instead.

  • Step 2: Pull out individual words from your string, and check if they exist in your trie/hashset. Depending on what the format of your string is, you can either split on white space, or use a simple regex using word boundaries (\\b).

Creating the trie/hashset will take a tiny bit of time, but would only need to be done once for the duration of the program. Afterwards, all searches would be extremely quick.

You can be certain, though, that using a regex with so many characters is going to be a slow procedure.

For example, using a HashSet and splitting on whitespace:

HashSet<string> allWords = new HashSet<string>();
for(int i = 0; i < words.Length; i++) {
    allWords.Add(words[i]);
}

string[] wordsInText = TextBox1.Text.Split(null as string[], StringSplitOptions.RemoveEmptyEntries);
for(int i = 0; i < wordsInText.Length; i++) {
    if(allWords.Contains(wordsInText[i])) {
        Label1.Text = @"<b>" + wordsInText[i] + @"</b>";
        break;
    }
}

Using a string.replace creates a whole new string, which is terribly costly and should be avoided in such loops. StringBuilder is the way to go in this case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM