简体   繁体   English

如何优化正则表达式性能?

[英]How to optimize regular expression performance?

I have a very long regular expression.我有一个很长的正则表达式。 My regex is a combination of around 5000 or more phrases.我的正则表达式是大约 5000 个或更多短语的组合。

Also, the text on which I am executing the regex is also huge.此外,我正在执行正则表达式的文本也很大。 Text size is around 5kb.文本大小约为 5kb。

Because regex as well as the input text is huge, it takes minimum 2 minutes to execute the regex which is not acceptable in my project.因为正则表达式以及输入文本很大,执行正则表达式至少需要 2 分钟,这在我的项目中是不可接受的。

So, I would like to know how I can optimize this.所以,我想知道如何优化它。 One way I can think of is to split the regex and use multiple threads to minimize the execution time.我能想到的一种方法是拆分正则表达式并使用多个线程来最小化执行时间。 Is this the correct option or is there any other way?这是正确的选择还是有其他方法?

Part of my regex looks like this:我的正则表达式的一部分如下所示:

(ACS|ADDR.com Technologies|ADP private limited|ADP|ADP India private limited|AIT Software Services PTE limited|AMK Technologies private limited|ANMSoft Technologies private limited|ANZ Information Technology private limited|ASD Global India private Limited|ASD India private Limited|ASM Technologies private limited|AXA Group Solutions India private limited|AXA technology India limited|Aarkay Infonet private limited|AbsolutData Research and Analytics private limited|Accenture India private limited|Accenture Services India|Accenture Services P Limited|Accenture Services private Limited|Accenture|Accenture Software Private Limited|Accurum India private limited|AceTechnologies Inc|Aclat Inc|AcmeCeeYess Softech Private Limited|Adaequare India private limited|Adaequare Info private limited|Adea International private limited|Adea Technologies|Adeptra|Aditi Technologies|Adobe Systems|Adroit Business Solutions|Adroit and Claretdene Infotech private limited|Affron I (ACS|ADDR.com Technologies|ADP 私人有限公司|ADP|ADP 印度私人有限公司|AIT Software Services PTE 有限公司|AMK 技术私人有限公司|ANMSoft Technologies 私人有限公司|ANZ 信息技术私人有限公司|ASD 全球印度私人有限公司|ASD 印度私人有限公司有限公司|ASM Technologies 私人有限公司|AXA Group Solutions India 私人有限公司|AXA 技术印度有限公司|Aarkay Infonet 私人有限公司|AbsolutData Research and Analytics 私人有限公司|埃森哲印度私人有限公司|埃森哲服务印度|埃森哲服务 P 有限公司|埃森哲服务私人有限公司|埃森哲|埃森哲软件私人有限公司|Accurum India 私人有限公司|AceTechnologies Inc|Aclat Inc|AcmeCeeYess Softech 私人有限公司|Adaequare India 私人有限公司|Adaequare Info 私人有限公司|Adea International 私人有限公司|Adea Technologies|Adeptra|Aditi Technologies|Adobe Systems|Adroit商业解决方案|Adroit and Claretdene Infotech private limited|Affron I nfotech|Agile Software Enterprise private limited|Agilent Technologies International private limited|Akebono Soft Technologies private limited|AkebonoSoft Technologies private limited|Akmin Technologies|Algorhythm Technologies private limited|Allsec Technologies private limited|Alphonso Informex private limited|Altria Client Services|Altruist India private limited|Amdocs|Amdocs Development Center India private limited|Amdocs Development Centre India|American CyberSystems|American Express Service India private limited|American Stock Exchange|Amrok Securities|Anish Information Technology private limited|Ankhnet Informations private limited|Apex Technologies private limited|AppLabs|AppLabs Technologies private limited|Appshark India|Apptix Software private limited|Aquila Technologies|Arcot R and D Software private limited|Arsin Systems private limited|Ascendum Solutions private limited|AskMe Software private limited|Atos Origin private limited|Atos Origin| nfotech|Agile Software Enterprise 私人有限公司|Agilent Technologies International 私人有限公司|Akebono Soft Technologies 私人有限公司|AkebonoSoft Technologies 私人有限公司|Akmin Technologies|Algorhythm Technologies 私人有限公司|Allsec Technologies 私人有限公司|Alphonso Informex 私人有限公司|Altria 客户服务|Altruist India 私人有限公司|Amdocs|Amdocs Development Center India private limited|Amdocs Development Center India|American CyberSystems|American Express Service India private limited|美国证券交易所|Amrok Securities|Anish Information Technology private limited|Ankhnet Informations private limited|Apex Technologies private limited|AppLabs |AppLabs Technologies 私人有限公司|Appshark India|Apptix Software 私人有限公司|Aquila Technologies|Arcot R and D Software 私人有限公司|Arsin Systems 私人有限公司|Ascendum Solutions 私人有限公司|AskMe Software 私人有限公司|Atos Origin 私人有限公司|Atos Origin| Atos Origin India private limited|Aurigo Software Technologies private limited|Aurona Technologies private limited|Autopower Software Solutions|Aztecsoft|BMC Software India private limited|Balasai Net private limited|Bayon Solutions private limited|Beachwood Computing Limited|Birlasoft limited|Blue Bird Technologies private limited|Blue Fountain Media private limited|Blue Star InfoTech|Boden Inc|Boston|Braahamam Net Solutions private limited|Braahmam Net Solutions private limited|Brain Soft technology private limited|Brigade Corporation Private Limited|Business Link Automation India private limited|BusinessLink Automation private limited|C Ahead Info Technologies India private limited|CDI Corporation|CCG India private limited|CEM Solutions|CGI Information Systems and Management Consultants private limited|CGI Information Systems private limited|CGI Information System and Management Consultants private limited|CGI Information and Management private limited|CGI Netvorks|CISCO Systems Ind Atos Origin India 私人有限公司|Aurigo Software Technologies 私人有限公司|Aurona Technologies 私人有限公司|Autopower 软件解决方案|Aztecsoft|BMC 软件印度私人有限公司|Balasai Net 私人有限公司|Bayon Solutions 私人有限公司|Beachwood Computing Limited|Birlasoft 有限公司|Blue Bird Technologies 私人有限公司|Blue Fountain Media 私人有限公司|Blue Star InfoTech|Boden Inc|波士顿|Braahmam Net Solutions 私人有限公司|Braahmam Net Solutions 私人有限公司|Brain Soft 技术私人有限公司|Brigade Corporation 私人有限公司|Business Link Automation India 私人有限公司|BusinessLink 自动化私人有限公司有限公司|C Ahead Info Technologies India private limited|CDI Corporation|CCG India private limited|CEM Solutions|CGI 信息系统和管理顾问私人有限公司|CGI 信息系统私人有限公司|CGI 信息系统和管理顾问私人有限公司|CGI 信息和管理私人有限公司有限|CGI Netvorks|CISCO Systems Ind ia private limited|CMC Limited|COMSYS Inc|CORE SHELL TECHNOLOGIES|CRC Software India private limited|CRV Executive Search private limited|CS Software Solutions private Limited|CSC India private Limited|CSS Corp private limited|Cambridge Solutions Limited|Cambridge Solutions|Cambridge Solutions Sdn. ia private limited|CMC Limited|COMSYS Inc|CORE SHELL 技术|CRC Software India private limited|CRV Executive Search Private Limited|CS Software Solutions Private Limited|CSC India private Limited|CSS Corp private limited|Cambridge Solutions Limited|Cambridge Solutions|Cambridge解决方案私人有限公司。 Bhd|Candor Ind. private limited|Candor India private limited|Canvas Creatives private limited|Canvera|Capgemini Business Service India Limited|Capgemini private)有限公司|Candor Ind. private limited|Candor India private limited|Canvas Creatives private limited|Canvera|Capgemini Business Service India Limited|Capgemini private)

I am using C# for this stuff.我正在为这些东西使用 C#。

Please enlighten !!!!请赐教!!!!

You can optimize a regex by using atomic grouping or using possessive quantifiers where possible.您可以通过使用原子分组或在可能的情况下使用所有格量词来优化正则表达式。

Also, if your have stuff like .* or .+ in your regex, which can be real memory/runtime hogs, replace them with (possessive) character classes (again, if possible).此外,如果您的正则表达式中有.*.+之类的东西,它们可能是真正的内存/运行时猪,请将它们替换为(拥有)字符类(如果可能的话)。

For more specific answers, you'll need to post your regex.有关更具体的答案,您需要发布您的正则表达式。

Good luck!祝你好运!

You can greatly improve the performance of this regex by prepending \b at the beginning:您可以通过在开头添加\b来大大提高此正则表达式的性能:

\b(ACS| ... |Z)

This will prevent a check on every character, and check every word instead.这将阻止检查每个字符,而是检查每个单词。

One optimization would be to extract common prefixes.一种优化是提取公共前缀。 Change occurences like更改事件如

(This is some text|This is some other text)

to

This is some (text|other text)

This should also be done on every level.这也应该在每个级别上进行。 Change occurences like更改事件如

ABCD|ADCB|BACD|BADC|BCAD|BCDA|BDAC|BDCA|CABD

to

A(BCD|DCB)|B(A(CD|DC)|C(AD|DA)|D(AC|CA))|CABD

This optimization is so that the Regex engine wont have to test for the same characters multiple times.这种优化是为了让 Regex 引擎不必多次测试相同的字符。

It can be achieved by sorting the phases, and looking at successive elements.它可以通过对阶段进行排序并查看连续元素来实现。 Be careful not to split at meta-characters.注意不要在元字符处拆分。 You don't want to split in the middle of .* or \.您不想在.*\. . .

Another way would be to use a Trie-structure to find the prefixes.另一种方法是使用 Trie 结构来查找前缀。 This is more robust, but a little more complicated.这更健壮,但更复杂一些。

I know it's old, but still...我知道它很旧,但仍然...

"OR" rules (for this matter all standard rules: concat, repeat and or) doesn't require manual optimization. “OR”规则(就此而言,所有标准规则:concat、repeat 和 or)不需要手动优化。 While compiling most regexp engines will optimize it.在编译大多数正则表达式引擎时会对其进行优化。 Sometimes it's the opposite - having too many groups may have performance impact, as the engine has to save each group's match.有时情况恰恰相反——组太多可能会对性能产生影响,因为引擎必须保存每个组的比赛。

What hits performance really hard is look ahead and look behind rules, which are not used in your query.真正影响性能的是向前看和向后看规则,这些规则不会在您的查询中使用。

In this case author could add '\b' rule in the beginning and end of query to require whole word searching, which would significantly limit places that the engine would start matching.在这种情况下,作者可以在查询的开头和结尾添加 '\b' 规则以要求进行全词搜索,这将大大限制引擎将开始匹配的位置。

An example with Python (there is also a C-tool to optimize regular expressions at https://github.com/ksx123/regex-optimization ): Python 的示例(在https://github.com/ksx123/regex-optimization中还有一个优化正则表达式的 C 工具):

import hachoir_regex
optimized = hachoir_regex.parse("(ACS|ADDR.com Technologies|ADP private limited|ADP|ADP India private limited|AIT Software Services PTE limited|AMK Technologies private limited|ANMSoft Technologies private limited|ANZ Information Technology private limited|ASD Global India private Limited|ASD India private Limited|ASM Technologies private limited|AXA Group Solutions India private limited|AXA technology India limited|Aarkay Infonet private limited|AbsolutData Research and Analytics private limited|Accenture India private limited|Accenture Services India|Accenture Services P Limited|Accenture Services private Limited|Accenture|Accenture Software Private Limited|Accurum India private limited|AceTechnologies Inc|Aclat Inc|AcmeCeeYess Softech Private Limited|Adaequare India private limited|Adaequare Info private limited|Adea International private limited|Adea Technologies|Adeptra|Aditi Technologies|Adobe Systems|Adroit Business Solutions|Adroit and Claretdene Infotech private limited|Affron Infotech|Agile Software Enterprise private limited|Agilent Technologies International private limited|Akebono Soft Technologies private limited|AkebonoSoft Technologies private limited|Akmin Technologies|Algorhythm Technologies private limited|Allsec Technologies private limited|Alphonso Informex private limited|Altria Client Services|Altruist India private limited|Amdocs|Amdocs Development Center India private limited|Amdocs Development Centre India|American CyberSystems|American Express Service India private limited|American Stock Exchange|Amrok Securities|Anish Information Technology private limited|Ankhnet Informations private limited|Apex Technologies private limited|AppLabs|AppLabs Technologies private limited|Appshark India|Apptix Software private limited|Aquila Technologies|Arcot R and D Software private limited|Arsin Systems private limited|Ascendum Solutions private limited|AskMe Software private limited|Atos Origin private limited|Atos Origin|Atos Origin India private limited|Aurigo Software Technologies private limited|Aurona Technologies private limited|Autopower Software Solutions|Aztecsoft|BMC Software India private limited|Balasai Net private limited|Bayon Solutions private limited|Beachwood Computing Limited|Birlasoft limited|Blue Bird Technologies private limited|Blue Fountain Media private limited|Blue Star InfoTech|Boden Inc|Boston|Braahamam Net Solutions private limited|Braahmam Net Solutions private limited|Brain Soft technology private limited|Brigade Corporation Private Limited|Business Link Automation India private limited|BusinessLink Automation private limited|C Ahead Info Technologies India private limited|C.D.I Corporation|CCG India private limited|CEM Solutions|CGI Information Systems and Management Consultants private limited|CGI Information Systems private limited|CGI Information System and Management Consultants private limited|CGI Information and Management private limited|CGI Netvorks|CISCO Systems India private limited|CMC Limited|COMSYS Inc|CORE SHELL TECHNOLOGIES|CRC Software India private limited|CRV Executive Search private limited|CS Software Solutions private Limited|CSC India private Limited|CSS Corp private limited|Cambridge Solutions Limited|Cambridge Solutions|Cambridge Solutions Sdn. Bhd|Candor Ind. private limited|Candor India private limited|Canvas Creatives private limited|Canvera|Capgemini Business Service India Limited|Capgemini private)")
len(str(optimized)) # has length 3048

While the original string has length 3399 .而原始字符串的长度为3399 The bigger the string gets, the more optimizations are possible.字符串越大,可能的优化就越多。 This uses the hachoir-regex library .这使用hachoir-regex You could use this in addition to adding \b , as proposed.除了按照建议添加\b之外,您还可以使用它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM