简体   繁体   English

用包含(?:。| \\ s)*?的模式进行正则表达式搜索 需要越来越长的时间

[英]Regex search with pattern containing (?:.|\s)*? takes increasingly long time

My regex is taking increasingly long to match (about 30 seconds the 5th time) but needs to be applied for around 500 rounds of matches. 我的正则表达式要花越来越长的时间进行比赛(第5次大约30秒),但需要进行约500轮比赛。 I suspect catastrophic backtracking. 我怀疑灾难性的回溯。 Please help! 请帮忙! How can I optimize this regex: 如何优化此正则表达式:

String regex = "<tr bgcolor=\"ffffff\">\\s*?<td width=\"20%\"><b>((?:.|\\s)+?): *?</b></td>\\s*?<td width=\"80%\">((?:.|\\s)*?)(?=(?:</td>\\s*?</tr>\\s*?<tr bgcolor=\"ffffff\">)|(?:</td>\\s*?</tr>\\s*?</table>\\s*?<b>Tags</b>))";

EDIT: since it was not clear(my bad): i am trying to take a html formatted document and reformat by extracting the two search groups and adding formating afterwards. 编辑:由于目前尚不清楚(我不好):我正试图提取一个html格式的文档,并通过提取两个搜索组并在以后添加格式来重新格式化。

The alternation (?:.|\\\\s)+? 交替(?:.|\\\\s)+? is very inefficient, as it involves too much backtracking. 效率很高,因为它涉及太多的回溯。

Basically, all variations of this pattern are extremely inefficient: (?:.|\\s)*? 基本上,此模式的所有变体都效率极低: (?:.|\\s)*? , (?:.|\\n)*? (?:.|\\n)*? , (?:.|\\r\\n)*? (?:.|\\r\\n)*? and there greedy counterparts, too ( (?:.|\\s)* , (?:.|\\n)* , (?:.|\\r\\n)* ). 还有贪婪的对应对象( (?:.|\\s)*(?:.|\\n)*(?:.|\\r\\n)* )。 (.|\\s)*? is probably the worst of them all. 可能是所有人中最糟糕的。

Why? 为什么?

The two alternatives, . 两种选择. and \\s may match the same text at the same location, the both match regular spaces at least. \\s可能在同一位置匹配相同的文本,两者都至少匹配常规空格。 See this demo taking 3555 steps to complete and .*? 看到此演示程序需要3555个步骤才能完成, .*? demo (with s modifier) taking 1335 steps to complete. 演示 (带有s修饰符)需要1335个步骤才能完成。

Patterns like (?:.|\\n)*? (?:.|\\n)*? / (?:.|\\n)* in Java often cause a Stack Overflow issue , and the main problem here is related to the use of alternation (that already alone causes backtracking) that matches char by char, and then the group is modified with a quantifier of unknown length. Java中的/ (?:.|\\n)*通常会导致堆栈溢出问题 ,并且这里的主要问题与使用逐字符匹配char的交替(已经单独导致回溯)有关,然后修改了组与未知长度的量词。 Although some regex engines can cope with this and do not throw errors, this type of pattern still causes slowdowns and is not recommended to use (only in ElasticSearch Lucene regex engine the (.|\\n) is the only way to match any char). 尽管某些正则表达式引擎可以解决此问题并且不会引发错误,但是这种类型的模式仍然会导致速度变慢,并且不建议使用(仅在ElasticSearch Lucene正则表达式引擎中, (.|\\n)是匹配任何字符的唯一方法) 。

Solution

If you want to match any characters including whitespace with regex, do it with 如果要使用正则表达式匹配任何字符(包括空格),请使用

[\\s\\S]*?

Or enable singleline mode with (?s) (or Pattern.DOTALL Matcher option) and just use . 或使用(?s) (或Pattern.DOTALL Matcher选项)启用单行模式,仅使用即可. (eg (?s)start(.*?)end ). (例如(?s)start(.*?)end )。

NOTE : To manipulate HTML, use a dedicated parser, like jsoup. 注意 :要操作HTML,请使用专用的解析器,例如jsoup。 Here is an SO post discussing Java HTML parsers . 这是一篇讨论Java HTML解析器的文章

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM