简体   繁体   中英

A speedy way to match a text to multiple regex pattern

I have about 50000 keyword that regex pattern is applied. My application gets some text content and try to find keywords that are mathcing with this content.

I'm doing this as loop through all keywords and search each of them in the content.

Because There are too many content to match I'm willing to find a better way if exist.

Is there any better way to do it?

This is the sample code I'm currently doing:

 List<string> keywords = getKeywords();
        string textToMatch = getNews();

        List<string> result = new List<string>();

        foreach (var keyword in keywords)
        {
            Match r =  Regex.Match(textToMatch, keyword);
            if(r.Success)
                result.Add(keyword)
        }

First of all you can use RegexOptions.Compiled which instructs the regular expression engine to compile the regular expression expression into IL using lightweight code generation. Program will start slower, but matches using the regular expression are faster.

Next step would be some nice implementantion of Consumer Producer design pattern. Sadly I dont know what is slowest in your operations but if you try implement this pattern it should be faster (some pseudocode below)

        BlockingCollection<string> collection = new BlockingCollection<string>();
        Action productionAction = () =>
        {
           //produce data then
            collection.Add(ProcessedData);
        };
        Action consumentAction = () =>
        {
           //consume data
            var data = collection.Take();
            //then
            //do your things

        };
        Parallel.Invoke(productionAction,consumentAction);
        // code will end here when everything will be processed
        // also you can change Action to TaskRun to use some Multithreading

You can also try easiest way that can improve performence significally (or not! I do not have knowledge of rest of your code!) by replacing your loop with Parallel.ForEach

If we think only one text to match with multiple keywords there is no completly different way to work on. We can just use Parallel.For , compiled regex etc.

In my case, I get too much text messages to match with keywords. Let's assume I have 50 text and 50000 keywords. Normally I go to keyword loop for each texts. Now, first I merged all text into one big text. Then run match for it. Matched keyword list will be returned. Finally, run match for each text again but only for matched keyword list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM