简体   繁体   中英

PHP PCRE allow nested patterns (recursion) in a string

I got a string like 1(8()3(6()7())9()3())2(4())3()1(0()3()) which is representing a tree. A bracket appears, if we go one level deeper. Numbers on the same level are neighbours.

Now want to add nodes, for example I want to add a 5 to every path where we have a 1 on the first and a 3 on the second level, so I want to put a 5() after every 3( which is inside of a 1( . So 5() has to be added 3 times and the result should be 1(8()3(5()6()7())9()3(5()))2(4())3()1(0()3(5()))

The Problem is, that I don't get the code working with the PCRE recursion. If I match a tree representation string without and fixed paths like 1( and 3( it works, but when I generate a regex with those fixed patterns, it doesn't work.

This is my code:

<?php
header('content-type: text/plain;utf-8');

$node = [1, 3, 5];
$path = '1(8()3(6()7())9()3())2(4())3()1(0()3())';

echo $path.'
';

$nes = '\((((?>[^()]+)|(?R))*)\)';
$nes = '('.$nes.')*';

echo preg_match('/'.$nes.'/x', $path) ? 'matches' : 'matches not';
echo '
';

// creates a regex with the fixed path structure, but allows nested elements in between
// in this example something like: /^anyNestedElementsHere 1( anyNestedElementsHere 3( anyNestedElementsHere ))/
$re = $nes;
for ($i = 0; $i < count($node)-1; $i++) {
    $re .= $node[$i].'\(';
    if ($i != count($node)-2)
        $re .= $nes;
}
$re = '/^('.$re.')/x';

echo str_replace($nes, '   '.$nes.'   ', $re).'
';
echo preg_match($re, $path) ? 'matches' : 'matches not';
echo '
';
// append 5()
echo preg_replace($re, '${1}'.$node[count($node)-1].'()', $path);
?>

And this is the output, where you can see how the generated regex looks like:

1(8()3(6()7())9()3())2(4())3()1(0()3())
matches
/^(   (\((((?>[^()]+)|(?R))*)\))*   1\(   (\((((?>[^()]+)|(?R))*)\))*   3\()/x
matches not
1(8()3(6()7())9()3())2(4())3()1(0()3())

I hope you understand my problem and hope you can tell me, where my error is.

Thanks a lot!

Solution

Regex

The following regex matches nested brackets recursively, finding an opening 1( on the first level, and an opening 3( on the second level (as a direct child). It also attempts successive matches, either on the same level or going down the respective levels to find another match.

~
(?(?=\A)  # IF: First match attempt (if at start of string)   - -

  # we are on 1st level => find next "1("

  (?<balanced_brackets>
    # consumes balanced brackets recursively where there is no match
    [^()]*+
    \(  (?&balanced_brackets)*?  \)
  )*?

  # match "1(" => enter level 2
  1\(

|         # ELSE: Successive matches  - - - - - - - - - - - - - -

  \G    # Start at end of last match (level 3)

  # Go down to level 2 - match ")"
  (?&balanced_brackets)*?
  \)

  # or go back to level 1 - matching another ")"
  (?>
    (?&balanced_brackets)*?
    \)

    # and enter level 2 again
    (?&balanced_brackets)*?
    1\(
  )*?
)                                      # - - - - - - - - - - - -

# we are on level 2 => consume balanced brackets and match "3("
(?&balanced_brackets)*?
3\K\(  # also reset the start of the match
~x

Replacement

(5()

Text

Input:
1(8()3(6()7())9()3())2(4())3()1(0()3())

Output:
1(8()3(5()6()7())9()3(5()))2(4())3()1(0()3(5()))
       ^^^            ^^^                  ^^^
       [1]            [2]                  [3]

regex101 demo


How it works

We start by using a conditional subpattern to distinguish between:

  • the first match attempt (from level 1) and
  • the successive attempts (starting at level 3, anchored with the \\G assertion ).
(?(?=\A)  # IF followed by start of string
    # This is the first attempt
|         # ELSE
    # This is another attempt
    \G    # and we'll anchor it to the end of last match
)

For the first match , we'll consume all nested brackets that don't match 1( , in order to get the cursor to a position in the first level where it could find a successful match.

  • This is a well-known recursive pattern to match nested constructs. If you're unfamiliar with it, please refer to Recursion and Subroutines .
(?<balanced_brackets>        # ANY NUMBER OF BALANCED BRACKETS
  [^()]*+                    # match any characters 
  \(                         # opening bracket
    (?&balanced_brackets)*?  #  with nested bracket (recursively)
  \)                         # closing bracket in the main level
)*?                          # Repeated any times (lazy)

Notice this is a named group that we will use as a subroutine call many times in the pattern to consume unwanted balanced brackets, as (?&balanced_brackets)*? .

Next levels . Now, to enter level 2, we need to match:

1\(

And finally, we'll consume any balanced brackets until we find the opening of the 3rd level:

(?&balanced_brackets)*?
3\(

That's it. We've just matched our first occurrence, so we can insert the replacement text in that position.

Next match . For the successive match attempts, we can either:

  • go down to level 2 matching a closing ) to find another occurrence of 3(
  • go further down to level 1 matching 2 closing ) and, from there, match using the same strategy as we used for the first match.

This is achieved with the following subpattern:

\G                             # anchored to the end of last match (level 3)
(?&balanced_brackets)*?        # consume any balanced brackets
\)                             # go down to level 2
                               #
(?>                            # And optionally
  (?&balanced_brackets)*?      #   consume level 2 brackets
  \)                           #   to go down to level 1
  (?&balanced_brackets)*?      #   consume level 1 brackets
  1\(                          #   and go up to level 2 again
)*?                            # As many times as it needs to (lazy)

To conclude , we can match the opening of the 3rd level:

(?&balanced_brackets)*?
3\(

We'll also reset the start of match near the end of the pattern, with \\K , to only match the last opening bracket. Thus, we can simply replace with (5() , avoiding the use of backreferences.


PHP Code

We only need to call preg_replace() with the same values used above.

Ideone demo


Why did your regex fail?

Since you asked, the pattern is anchored to the start of string. It can only match the first occurrence.

/^(   (\((((?>[^()]+)|(?R))*)\))*   1\(   (\((((?>[^()]+)|(?R))*)\))*   3\()/x

Also, it doesn't match the first occurrence because the construct (?R) recurses the the whole pattern (trying to match ^ again). We could change (?R) to (?2) .

The main reason, though, is because it is not consuming the characters before any opening \\( . For example:

Input:
1(8()3(6()7())9()3())2(4())3()1(0()3())
  ^
  #this "8" can't be consumed with the pattern

There's also a behaviour that should be considered: PCRE treats recursion as atomic . So you have to make sure that the pattern will consume unwanted brackets as in the above example, but also avoid matching 1( or 3( in their respective levels.

I'd break down this problem into two smaller parts:

First, extract the 1 nodes, using the following regex:

(?(DEFINE)
  (?<tree>
    (?: \d+ \( (?&tree) \) )*
  )
)
\b 1 \( (?&tree) \)

Demo

Use preg_replace_callback for this. This will match 1(8()3(6()7())9()3()) and 1(0()3()) .

Next, it's just a matter of replacing 3( with 3(5() and you're done.

Example in PHP:

$path = '1(8()3(6()7())9()3())2(4())3()1(0()3())';

$path = preg_replace_callback('#
    (?(DEFINE)
      (?<tree>
        (?: \d+ \( (?&tree) \) )*
      )
    )
    \b 1 \( (?&tree) \)
#x', function($m) {
    return str_replace('3(', '3(5()', $m[0]);
}, $path);

The result is: 1(8()3(5()6()7())9()3(5()))2(4())3()1(0()3(5()))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM