简体   繁体   English

一种将文本匹配到多个正则表达式模式的快速方法

[英]A speedy way to match a text to multiple regex pattern

I have about 50000 keyword that regex pattern is applied. 我有大约50000个使用正则表达式模式的关键字。 My application gets some text content and try to find keywords that are mathcing with this content. 我的应用程序获取了一些文本内容,并尝试查找与此内容匹配的关键字。

I'm doing this as loop through all keywords and search each of them in the content. 我这样做是为了遍历所有关键字并在内容中搜索它们中的每一个。

Because There are too many content to match I'm willing to find a better way if exist. 因为有太多内容无法匹配,所以我愿意找到一个更好的方法。

Is there any better way to do it? 有什么更好的方法吗?

This is the sample code I'm currently doing: 这是我当前正在执行的示例代码:

 List<string> keywords = getKeywords();
        string textToMatch = getNews();

        List<string> result = new List<string>();

        foreach (var keyword in keywords)
        {
            Match r =  Regex.Match(textToMatch, keyword);
            if(r.Success)
                result.Add(keyword)
        }

First of all you can use RegexOptions.Compiled which instructs the regular expression engine to compile the regular expression expression into IL using lightweight code generation. 首先,您可以使用RegexOptions.Compiled ,它指示正则表达式引擎使用轻量级代码生成将正则表达式编译为IL。 Program will start slower, but matches using the regular expression are faster. 程序启动速度会变慢,但是使用正则表达式的匹配会更快。

Next step would be some nice implementantion of Consumer Producer design pattern. 下一步将是Consumer Producer设计模式的一些不错的实现。 Sadly I dont know what is slowest in your operations but if you try implement this pattern it should be faster (some pseudocode below) 可悲的是,我不知道您的操作中最慢的是什么,但是如果您尝试实现此模式,它应该会更快(下面的一些伪代码)

        BlockingCollection<string> collection = new BlockingCollection<string>();
        Action productionAction = () =>
        {
           //produce data then
            collection.Add(ProcessedData);
        };
        Action consumentAction = () =>
        {
           //consume data
            var data = collection.Take();
            //then
            //do your things

        };
        Parallel.Invoke(productionAction,consumentAction);
        // code will end here when everything will be processed
        // also you can change Action to TaskRun to use some Multithreading

You can also try easiest way that can improve performence significally (or not! I do not have knowledge of rest of your code!) by replacing your loop with Parallel.ForEach 您也可以尝试通过用Parallel.ForEach替换循环来显着提高性能的最简单方法(或者不可以!我不了解其余代码!)。

If we think only one text to match with multiple keywords there is no completly different way to work on. 如果我们认为只有一个文本可以与多个关键字匹配,那么就没有完全不同的工作方式。 We can just use Parallel.For , compiled regex etc. 我们可以只使用Parallel.For ,已编译的regex等。

In my case, I get too much text messages to match with keywords. 就我而言,我收到太多的短信,无法与关键字匹配。 Let's assume I have 50 text and 50000 keywords. 假设我有50个文字和50000个关键字。 Normally I go to keyword loop for each texts. 通常我会为每个文本进入关键字循环。 Now, first I merged all text into one big text. 现在,首先,我将所有文本合并为一个大文本。 Then run match for it. 然后运行匹配。 Matched keyword list will be returned. 匹配的关键字列表将被返回。 Finally, run match for each text again but only for matched keyword list. 最后,再次对每个文本运行匹配,但仅对匹配的关键字列表运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM