简体   繁体   中英

Regex search with pattern containing (?:.|\s)*? takes increasingly long time

My regex is taking increasingly long to match (about 30 seconds the 5th time) but needs to be applied for around 500 rounds of matches. I suspect catastrophic backtracking. Please help! How can I optimize this regex:

String regex = "<tr bgcolor=\"ffffff\">\\s*?<td width=\"20%\"><b>((?:.|\\s)+?): *?</b></td>\\s*?<td width=\"80%\">((?:.|\\s)*?)(?=(?:</td>\\s*?</tr>\\s*?<tr bgcolor=\"ffffff\">)|(?:</td>\\s*?</tr>\\s*?</table>\\s*?<b>Tags</b>))";

EDIT: since it was not clear(my bad): i am trying to take a html formatted document and reformat by extracting the two search groups and adding formating afterwards.

The alternation (?:.|\\\\s)+? is very inefficient, as it involves too much backtracking.

Basically, all variations of this pattern are extremely inefficient: (?:.|\\s)*? , (?:.|\\n)*? , (?:.|\\r\\n)*? and there greedy counterparts, too ( (?:.|\\s)* , (?:.|\\n)* , (?:.|\\r\\n)* ). (.|\\s)*? is probably the worst of them all.

Why?

The two alternatives, . and \\s may match the same text at the same location, the both match regular spaces at least. See this demo taking 3555 steps to complete and .*? demo (with s modifier) taking 1335 steps to complete.

Patterns like (?:.|\\n)*? / (?:.|\\n)* in Java often cause a Stack Overflow issue , and the main problem here is related to the use of alternation (that already alone causes backtracking) that matches char by char, and then the group is modified with a quantifier of unknown length. Although some regex engines can cope with this and do not throw errors, this type of pattern still causes slowdowns and is not recommended to use (only in ElasticSearch Lucene regex engine the (.|\\n) is the only way to match any char).

Solution

If you want to match any characters including whitespace with regex, do it with

[\\s\\S]*?

Or enable singleline mode with (?s) (or Pattern.DOTALL Matcher option) and just use . (eg (?s)start(.*?)end ).

NOTE : To manipulate HTML, use a dedicated parser, like jsoup. Here is an SO post discussing Java HTML parsers .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM