简体   繁体   English

这个递归正则表达式究竟是如何工作的?

[英]How exactly does this recursive regex work?

This is a followup to this question . 这是这个问题的后续内容。

Have a look at this pattern: 看看这个模式:

(o(?1)?o)

It matches any sequence of o with a length of 2 n , with n ≥ 1. 它匹配o任何序列,长度为2 n ,n≥1。
It works, see regex101.com (word boundaries added for better demonstration). 它有效,请参阅regex101.com (为了更好的演示而添加了单词边界)。
The question is: Why? 问题是: 为什么?

In the following, the description of a string (match or not) will simply be a bolded number or a bolded term that describes the length, like 2 n . 在下文中,字符串的描述(匹配与否)将只是粗体数字或粗体术语,描述长度,如2 n

Broken down (with added whitespaces): 细分(添加空格):

( o (?1)? o )
(           ) # Capture group 1
  o       o   # Matches an o each at the start and the end of the group
              # -> the pattern matches from the outside to the inside.
    (?1)?     # Again the regex of group 1, or nothing.
              # -> Again one 'o' at the start and one at the end. Or nothing.

I don't understand why this doesn't match 2n , but 2 n , because I would describe the pattern as *an undefined number of oo , stacked into each other. 我不明白为什么这不匹配2n ,而是2 n ,因为我会将模式描述为*未定义的oo数,相互堆叠。

Visualization: 可视化:

No recursion, 2 is a match: 没有递归, 2是匹配:

oo

One recursion, 4 is a match: 一次递归, 4是匹配:

o  o
 oo

So far, so easy. 到目前为止,这么容易。

Two recursions. 两次递归。 Obviously wrong because the pattern does not match 6 : 显然是错误的,因为模式不匹配6

o    o
 o  o
  oo

But why? 但为什么? It seems to fit the pattern. 它似乎符合这种模式。

I conclude that it's not simply the plain pattern that is repeated because otherwise 6 would have to match. 我的结论是,这不仅仅是重复的普通模式,因为否则6必须匹配。

But according to regular-expressions.info : 但根据regular-expressions.info

(?P<name>[abc])(?1)(?P>name) matches three letters like (?P<name>[abc])[abc][abc] does. (?P<name>[abc])(?1)(?P>name)匹配三个字母,如(?P<name>[abc])[abc][abc]

and

[abc])(?1){3} [...] is equivalent to ([abc])[abc]{3} [abc])(?1){3} [...]相当于([abc])[abc]{3}

So it does seem to simply rematch the regex code without an information about the previous match of the capture group. 因此,它似乎只是重新匹配正则表达式代码而没有关于捕获组的上一个匹配的信息。

Can someone explain and maybe visualize why this pattern matches 2 n and nothing else? 有人可以解释并可能想象出为什么这个模式匹配2 n而不是别的吗?

Edit: 编辑:

It was mentioned in the comments: 评论中提到:

I doubt that referencing a capture group inside of itself is actually a supported case. 我怀疑引用自身内部的捕获组实际上是一个支持的案例。

regular-expressions.info does mention the technique: regular-expressions.info确实提到了这个技术:

If you place a call inside the group that it calls, you'll have a recursive capturing group. 如果您在其调用的组内部发出呼叫,您将拥有一个递归捕获组。

You understand recursion correctly. 你正确理解递归。 Word boundaries baffle you here. 词边界让你感到困惑。 The \\b around the pattern require the regex engine to only match the string if it is not preceded and followed with word chars. 模式周围的\\b要求正则表达式引擎仅匹配字符串(如果字符串不在前面并且后跟单词字符)。

See how the recursion goes here: 看看递归如何:

( o      (?1)?         o )  => oo

(?1) is then replaced with (o(?1)?o) : (?1)然后用(o(?1)?o)代替:

( o   (?>o(?1)?o)?     o )  => oo or oooo

Then again: 然后再说:

(o (?>o(?>o(?1)?o)?o)?  o) => oo, oooo, oooooo

See the regex demo without word boundaries . 查看没有字边界正则表达式演示

Why adding (?>...) in the example above? 为什么在上面的例子中添加(?>...) Each recursion level in PHP recursive regexes is atomic , unlike Perl , and once a preceding level fails, the engine does not go back to the following one. PHP递归正则表达式中的每个递归级别都是原子的与Perl不同 ,一旦前一级别失败,引擎就不会返回到下一级别。

When you add word boundaries, the first o and last o matched cannot have any other word chars before/after. 当您添加单词边界时,匹配的第一个o和最后一个o不能在之前/之后具有任何其他单词chars。 So, ooo won't match then. 那么, ooo 就不会匹配了

See Recursive Regular Expressions explained step by step and Word Boundary: \\b at rexegg.com, too. 请参阅rexegg.com上逐步解释的递归正则表达式Word边界: \\b

Why does oooooo not get matched as a whole but as oooo and oo ? 为什么oooooo不能作为一个整体匹配,但作为oooooo

Again, each recursion level is atomic. 同样,每个递归级别都是原子级的。 oooooo is matched like this: oooooo匹配如下:

  • (o(?1)?o) matches the first o (o(?1)?o)匹配第一个o
  • (?1)? gets expanded and the pattern is now (o(?>o(?1)?o)?o) and it matches the second o in the input 得到扩展,模式现在是(o(?>o(?1)?o)?o)并且它匹配输入中的第二个o
  • It goes on until (o(?>o(?>o(?>o(?>o(?>o(?>o(?1)?o)?o)?o)?o)?o)?o)?o) that does not match the input any longer, backtracking happens, we go to the 6th level, 它一直持续到(o(?>o(?>o(?>o(?>o(?>o(?>o(?1)?o)?o)?o)?o)?o)?o)?o)不再与输入匹配,回溯发生,我们进入第6级,
  • The whole 6th recursion level also fails since it cannot match the necessary amount of o s 整个第6个递归级别也失败,因为它无法匹配必要的o s数量
  • This goes on until the level that can match the necessary amount of o s. 这一直持续到可以匹配必要数量的o的水平。

See the regex debugger : 请参阅正则表达式调试器

在此输入图像描述

This is more or less a follow up of Wiktors answer - even after removing the word boundaries, I had a hard time figuring out why oooooo (6) gets matched as oooo and oo , while ooooooo (7) gets matched as oooooo . 这或多或少都是Wiktors回答的后续内容 - 即使删除了边界一词,我也很难弄清楚为什么oooooo (6)被匹配为oooooo ,而ooooooo (7)被匹配为oooooo

Here is how it works in detail: 以下是它的详细工作原理:

When expanding the recursive pattern, the inner recursions are atomic. 在扩展递归模式时,内部递归是原子的。 With our pattern we can unroll it to 使用我们的模式,我们可以将其展开

(?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o)

(In the actual pattern this get's unrolled once more, but that doesn't change the explanation) (在实际模式中,这个get再次展开,但这不会改变解释)

And here is how the strings are matched - first oooooo (6) 以下是字符串的匹配方式 - 首先是oooooo (6)

(?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o)
o   |ooooo                          <- first o gets matched by first atomic group
o   o   |oooo                       <- second o accordingly
o   o   o   |ooo                    <- third o accordingly
o   o   o   o   |oo                 <- fourth o accordingly
o   o   o   o   oo|                 <- fifth/sixth o by the innermost atomic group
                     ^              <- there is no more o to match, so backtracking starts - innermost ag is not matched, cursor positioned after 4th character
o   o   o   o   xx   o   |o         <- fifth o matches, fourth ag is successfully matched (thus no backtracking into it)
o   o   o   o   xx   o   o|         <- sixth o matches, third ag is successfully matched (thus no backtracking into it)
                           ^        <- no more o, backtracking again - third ag can't be backtracked in, so backtracking into second ag (with matching 3rd 0 times)
o   o                      |oo<oo   <- third and fourth o close second and first atomic group -> match returned  (4 os)

And now ooooooo (7) 现在ooooooo (7)

(?>o(?>o(?>o(?>o(?>oo)?o)?o)?o)?o)    
o   |oooooo                         <- first o gets matched by first atomic group
o   o   |ooooo                      <- second o accordingly
o   o   o   |oooo                   <- third o accordingly
o   o   o   o   |ooo                <- fourth o accordingly
o   o   o   o   oo|o                <- fifth/sixth o by the innermost atomic group
o   o   o   o   oo  o|              <- fourth ag is matched successfully (thus no backtracking into it)
                         ^          <- no more o, so backtracking starts here, no backtracking into fourth ag, try again 3rd
o   o   o                |ooo<o     <- 3rd ag can be closed, as well as second and first -> match returned (6 os)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM