简体   繁体   English

PHP PCRE允许在字符串中嵌套模式(递归)

[英]PHP PCRE allow nested patterns (recursion) in a string

I got a string like 1(8()3(6()7())9()3())2(4())3()1(0()3()) which is representing a tree. 我得到了像1(8()3(6()7())9()3())2(4())3()1(0()3())这样的字符串,它表示一棵树。 A bracket appears, if we go one level deeper. 如果我们更深一层,将出现一个括号。 Numbers on the same level are neighbours. 在同一级别上的数字是邻居。

Now want to add nodes, for example I want to add a 5 to every path where we have a 1 on the first and a 3 on the second level, so I want to put a 5() after every 3( which is inside of a 1( . So 5() has to be added 3 times and the result should be 1(8()3(5()6()7())9()3(5()))2(4())3()1(0()3(5())) 现在要添加节点,比如我想将增加5到,我们有一个每路1对第一和3在第二个层次,所以我想提出一个5()每经过3(这是内部a 1( 。因此必须将5()相加3次,结果应为1(8()3(5()6()7())9()3(5()))2(4())3()1(0()3(5()))

The Problem is, that I don't get the code working with the PCRE recursion. 问题是,我没有使用PCRE递归的代码。 If I match a tree representation string without and fixed paths like 1( and 3( it works, but when I generate a regex with those fixed patterns, it doesn't work. 如果我匹配不带固定路径(如1(3(的树表示字符串,它会起作用,但是当我生成具有这些固定模式的正则表达式时,它将不起作用。

This is my code: 这是我的代码:

<?php
header('content-type: text/plain;utf-8');

$node = [1, 3, 5];
$path = '1(8()3(6()7())9()3())2(4())3()1(0()3())';

echo $path.'
';

$nes = '\((((?>[^()]+)|(?R))*)\)';
$nes = '('.$nes.')*';

echo preg_match('/'.$nes.'/x', $path) ? 'matches' : 'matches not';
echo '
';

// creates a regex with the fixed path structure, but allows nested elements in between
// in this example something like: /^anyNestedElementsHere 1( anyNestedElementsHere 3( anyNestedElementsHere ))/
$re = $nes;
for ($i = 0; $i < count($node)-1; $i++) {
    $re .= $node[$i].'\(';
    if ($i != count($node)-2)
        $re .= $nes;
}
$re = '/^('.$re.')/x';

echo str_replace($nes, '   '.$nes.'   ', $re).'
';
echo preg_match($re, $path) ? 'matches' : 'matches not';
echo '
';
// append 5()
echo preg_replace($re, '${1}'.$node[count($node)-1].'()', $path);
?>

And this is the output, where you can see how the generated regex looks like: 这是输出,您可以在其中查看生成的正则表达式的样子:

1(8()3(6()7())9()3())2(4())3()1(0()3())
matches
/^(   (\((((?>[^()]+)|(?R))*)\))*   1\(   (\((((?>[^()]+)|(?R))*)\))*   3\()/x
matches not
1(8()3(6()7())9()3())2(4())3()1(0()3())

I hope you understand my problem and hope you can tell me, where my error is. 希望您理解我的问题,希望您能告诉我我的错误在哪里。

Thanks a lot! 非常感谢!

Solution

Regex 正则表达式

The following regex matches nested brackets recursively, finding an opening 1( on the first level, and an opening 3( on the second level (as a direct child). It also attempts successive matches, either on the same level or going down the respective levels to find another match. 下面的正则表达式递归匹配嵌套的括号,在第一层找到一个开口1(在第二层上找到一个开口3( (作为直接子代)。它也尝试连续的匹配,无论是在同一层上还是在相应层上向下水平找到另一个匹配。

~
(?(?=\A)  # IF: First match attempt (if at start of string)   - -

  # we are on 1st level => find next "1("

  (?<balanced_brackets>
    # consumes balanced brackets recursively where there is no match
    [^()]*+
    \(  (?&balanced_brackets)*?  \)
  )*?

  # match "1(" => enter level 2
  1\(

|         # ELSE: Successive matches  - - - - - - - - - - - - - -

  \G    # Start at end of last match (level 3)

  # Go down to level 2 - match ")"
  (?&balanced_brackets)*?
  \)

  # or go back to level 1 - matching another ")"
  (?>
    (?&balanced_brackets)*?
    \)

    # and enter level 2 again
    (?&balanced_brackets)*?
    1\(
  )*?
)                                      # - - - - - - - - - - - -

# we are on level 2 => consume balanced brackets and match "3("
(?&balanced_brackets)*?
3\K\(  # also reset the start of the match
~x

Replacement 替代

(5()

Text 文本

Input:
1(8()3(6()7())9()3())2(4())3()1(0()3())

Output:
1(8()3(5()6()7())9()3(5()))2(4())3()1(0()3(5()))
       ^^^            ^^^                  ^^^
       [1]            [2]                  [3]

regex101 demo regex101演示


How it works 这个怎么运作

We start by using a conditional subpattern to distinguish between: 我们首先使用conditional subpattern来区分:

  • the first match attempt (from level 1) and 第一次比赛尝试(从级别1开始)和
  • the successive attempts (starting at level 3, anchored with the \\G assertion ). 连续尝试(从第3级开始,以\\G assertion锚)。
(?(?=\A)  # IF followed by start of string
    # This is the first attempt
|         # ELSE
    # This is another attempt
    \G    # and we'll anchor it to the end of last match
)

For the first match , we'll consume all nested brackets that don't match 1( , in order to get the cursor to a position in the first level where it could find a successful match. 对于第一个匹配项 ,我们将使用所有不匹配1(嵌套括号,以便将光标移到第一级可以找到成功匹配项的位置。

  • This is a well-known recursive pattern to match nested constructs. 这是匹配嵌套构造的众所周知的递归模式。 If you're unfamiliar with it, please refer to Recursion and Subroutines . 如果您不熟悉它,请参阅Recursion和子Subroutines
(?<balanced_brackets>        # ANY NUMBER OF BALANCED BRACKETS
  [^()]*+                    # match any characters 
  \(                         # opening bracket
    (?&balanced_brackets)*?  #  with nested bracket (recursively)
  \)                         # closing bracket in the main level
)*?                          # Repeated any times (lazy)

Notice this is a named group that we will use as a subroutine call many times in the pattern to consume unwanted balanced brackets, as (?&balanced_brackets)*? 注意,这是一个named group ,我们将在模式中将其多次用作子例程调用,以消耗不需要的平衡括号,例如(?&balanced_brackets)*? .

Next levels . 下一级 Now, to enter level 2, we need to match: 现在,要进入级别2,我们需要匹配:

1\(

And finally, we'll consume any balanced brackets until we find the opening of the 3rd level: 最后,我们将消耗所有平衡的括号,直到找到第3级的开头:

(?&balanced_brackets)*?
3\(

That's it. 而已。 We've just matched our first occurrence, so we can insert the replacement text in that position. 我们刚刚匹配了第一个匹配项,因此我们可以在该位置插入替换文本。

Next match . 下一场比赛 For the successive match attempts, we can either: 对于连续的匹配尝试,我们可以:

  • go down to level 2 matching a closing ) to find another occurrence of 3( 下降到与关闭匹配的第2级)以查找另一次出现3(
  • go further down to level 1 matching 2 closing ) and, from there, match using the same strategy as we used for the first match. 进一步下降到1级,匹配2个close ) ,然后从那里匹配与第一个匹配相同的策略。

This is achieved with the following subpattern: 这可以通过以下子模式实现:

\G                             # anchored to the end of last match (level 3)
(?&balanced_brackets)*?        # consume any balanced brackets
\)                             # go down to level 2
                               #
(?>                            # And optionally
  (?&balanced_brackets)*?      #   consume level 2 brackets
  \)                           #   to go down to level 1
  (?&balanced_brackets)*?      #   consume level 1 brackets
  1\(                          #   and go up to level 2 again
)*?                            # As many times as it needs to (lazy)

To conclude , we can match the opening of the 3rd level: 总结一下,我们可以匹配第三个级别的开头:

(?&balanced_brackets)*?
3\(

We'll also reset the start of match near the end of the pattern, with \\K , to only match the last opening bracket. 我们还将在模式结尾附近使用\\K 重置比赛开始 ,以仅匹配最后一个左括号。 Thus, we can simply replace with (5() , avoiding the use of backreferences. 因此,我们可以简单地用(5()代替,避免使用反向引用。


PHP Code PHP代码

We only need to call preg_replace() with the same values used above. 我们只需要使用上面使用的相同值调用preg_replace()

Ideone demo Ideone演示


Why did your regex fail? 为什么您的正则表达式失败?

Since you asked, the pattern is anchored to the start of string. 如您所问,该模式已锚定到字符串的开头。 It can only match the first occurrence. 它只能匹配第一个匹配项。

/^(   (\((((?>[^()]+)|(?R))*)\))*   1\(   (\((((?>[^()]+)|(?R))*)\))*   3\()/x

Also, it doesn't match the first occurrence because the construct (?R) recurses the the whole pattern (trying to match ^ again). 而且,它不匹配第一次出现,因为构造(?R)递归了整个模式(试图再次匹配^ )。 We could change (?R) to (?2) . 我们可以将(?R)更改为(?2)

The main reason, though, is because it is not consuming the characters before any opening \\( . For example: 但是,主要原因是因为它在任何打开\\(之前都没有消耗字符。例如:

Input:
1(8()3(6()7())9()3())2(4())3()1(0()3())
  ^
  #this "8" can't be consumed with the pattern

There's also a behaviour that should be considered: PCRE treats recursion as atomic . 还应考虑一种行为: PCRE将递归视为atomic So you have to make sure that the pattern will consume unwanted brackets as in the above example, but also avoid matching 1( or 3( in their respective levels. 因此,您必须确保模式会像上面的示例一样使用不需要的括号,但也要避免在各自的级别匹配1(3(

I'd break down this problem into two smaller parts: 我将这个问题分解为两个较小的部分:

First, extract the 1 nodes, using the following regex: 首先,使用以下正则表达式提取1节点:

(?(DEFINE)
  (?<tree>
    (?: \d+ \( (?&tree) \) )*
  )
)
\b 1 \( (?&tree) \)

Demo 演示

Use preg_replace_callback for this. 为此使用preg_replace_callback This will match 1(8()3(6()7())9()3()) and 1(0()3()) . 这将匹配1(8()3(6()7())9()3())1(0()3())

Next, it's just a matter of replacing 3( with 3(5() and you're done. 接下来,只需要用3(5()替换3( 3(5()就可以了。

Example in PHP: PHP中的示例:

$path = '1(8()3(6()7())9()3())2(4())3()1(0()3())';

$path = preg_replace_callback('#
    (?(DEFINE)
      (?<tree>
        (?: \d+ \( (?&tree) \) )*
      )
    )
    \b 1 \( (?&tree) \)
#x', function($m) {
    return str_replace('3(', '3(5()', $m[0]);
}, $path);

The result is: 1(8()3(5()6()7())9()3(5()))2(4())3()1(0()3(5())) 结果是: 1(8()3(5()6()7())9()3(5()))2(4())3()1(0()3(5()))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM