简体   繁体   中英

Regex Recursion: Nth Subpatterns

I'm trying to learn about Recursion in Regular Expressions, and have a basic understanding of the concepts in the PCRE flavour. I want to break a string:

Geese (Flock) Dogs (Pack) 

into:

Full Match: Geese (Flock) Dogs (Pack) 
Group 1: Geese (Flock)
Group 2: Geese
Group 3: (Flock)
Group 4: Dogs (Pack)
Group 5: Dogs
Group 6: (Pack)

I know neither regex quite does this, but I was more curious as to the reason why the the first pattern works, but the second one doesn't.

Pattern 1: ((.*?)(\(\w{1,}\)))((.*?)(\g<3>))*
Pattern 2: ((.*?)(\(\w{1,}\)))((\g<2>)(\g<3>))*

Also, if for example you're dealing with a long string, and a pattern repeats itself, is it possible to continually expand the full match, and incrementally increase the groups without writing a loop statement separate to the regex.

Full Match: Geese (Flock) Dogs (Pack) Elephants (Herd) 
Group 1: Geese (Flock)
Group 2: Geese
Group 3: (Flock)
Group 4: Dogs (Pack)
Group 5: Dogs
Group 6: (Pack)
Group 7: Elephants (Herd)
Group 8: Elephants 
Group 9: (Herd)

This is the closest I've came to was this pattern, but the middle group: Dogs (Pack) becomes Group 0.

((.*?)(\(\w{1,}\)))((.*?)(\g<3>))*

Mind that recursion levels in PCRE are atomic. Once these patterns find a match they are never re-tried.

See Recursion and Subroutine Calls May or May Not Be Atomic :

Perl and Ruby backtrack into recursion if the remainder of the regex after the recursion fails. They try all permutations of the recursion as needed to allow the remainder of the regex to match. PCRE treats recursion as atomic . PCRE backtracks normally during the recursion, but once the recursion has matched, it does not try any further permutations of the recursion, even when the remainder of the regex fails to match. The result is that Perl and Ruby may find regex matches that PCRE cannot find, or that Perl and Ruby may find different regex matches.

Your second pattern, at the first recursion level, will look like

((.*?)(\(\w{1,}\)))(((?>.*?))((?>\(\w{1,}\))))*
                     ^^^^^^^  ^^^^^^^^^^^^^^

See demo . That is, \\g<2> is then (?>.*?) , not .*? . That means that, after the ((.*?)(\\(\\w{1,}\\))) pattern matched Geese (Flock) , the regex engine tries to match with (?>.*?) , sees it is a lazy pattern that does not have to consume any chars, skips it (and will never come back to this pattern), and tries to match with (?>\\(\\w{1,}\\)) . As there is no ( after ) , the regex returns what it consumed.

As for the second question, it is a common problem. It is not possible to get an arbitrary number of captures with a PCRE regex, as in case of repeated captures only the last captured value is stored in the group buffer. You cannot have more submatches in the resulting array than the number of capturing groups inside the regex pattern. See Repeating a Capturing Group vs. Capturing a Repeated Group for more details.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM