简体   繁体   中英

About php regexp for recursive pattern

I've this code:

$string="some text {@block}outside{@block}inside{@}outside{@} other text";

function catchPattern($string,$layer){
  preg_match_all(
    "/\{@block\}".
      "(".
        "(".
           "[^()]*|(?R)".
        ")*".
      ")".
    "\{@\}/",$string,$nodes);
  if(count($nodes)>1){
    for($i=0;$i<count($nodes[1]); $i++){
      if(is_string($nodes[1][$i])){
        if(strlen($nodes[1][$i])>0){
          echo "<pre>Layer ".$layer.":   ".$nodes[1][$i]."</pre><br />";
          catchPattern($nodes[1][$i],$layer+1);
        }
      }
    }
  }
}

catchPattern($string,0);

That gives me this output:

Layer 0:   outside{@block}inside{@}outside

Layer 1:   inside

And all it's ok! But If I change a bit string and regexp:

$string="some text {@block}outside{@block}inside{@end}outside{@end} other text";

function catchPattern($string,$layer){
  preg_match_all(
    "/\{@block\}".
      "(".
        "(".
           "[^()]*|(?R)".
        ")*".
      ")".
    "\{@end\}/",$string,$nodes);
  if(count($nodes)>1){
    for($i=0;$i<count($nodes[1]); $i++){
      if(is_string($nodes[1][$i])){
        if(strlen($nodes[1][$i])>0){
          echo "<pre>Layer ".$layer.":   ".$nodes[1][$i]."</pre><br />";
          catchPattern($nodes[1][$i],$layer+1);
        }
      }
    }
  }
}

catchPattern($string,0);

I didnt get any output. Why? I expected the same output.

The problem is that the backtracking limit is exhausted. You can always modify the backtracking limit . However, for the cases I have come across, rewriting the regex is the better solution .

You can't just anyhow modify an existing regex and expect to make it work, especially for recursive regex. It seems that you take the existing bracket matching regex and modify it. There are a few problems in your regex:

  • [^()]* : There is no reason to exclude () inside the text within the {@block}{@end} portion. But the more severe problem is that it matches {} . The engine will go all the way to the nearest () or the end of the string, fail to match, then backtrack. This is why the backtracking limit is reached.

    This can be fixed by changing this portion to [^{}] to disallow {} inside {@block}{@end} . Nested {@block}{@end} will still be matched, due to the recursion.

    Note that this will totally disallow {} to be specified as text within {@block}{@end} . It may be possible to modify the regex to allow such case, depending on the escaping scheme.

    I also change the quantifier of [^{}] from * to + , since there is no reason to match an empty string when the quantifier of the whole group ([^{}]+|(?R)) is * .

     /\\{@block\\}((?:[^{}]+|(?R))*)\\{@end\\}/ 
  • After the modification above, the second problem is with invalid input string. The default behavior of quantifier is that backtracking will be performed until a match is found or all possibilities are exhausted. Therefore, you will reach backtracking limit in those cases.

    Since what [^{}]+ can match and what the recursive regex can match are mutually exclusive 1 , the regex is not ambiguous and can be matched without backtracking. We can tell the engine not to backtrack by using possessive quantifier , which is the normal quantifier, with + added behind.

The final solution is:

/\{@block\}((?:[^{}]++|(?R))*+)\{@end\}/

Demo

Footnotes

1 : It is quite obvious, since text matching [^{}]+ will never start with { , while the text matching the recursive regex must start with { .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM