简体   繁体   中英

How are nested capturing groups numbered in regular expressions?

Is there a defined behavior for how regular expressions should handle the capturing behavior of nested parentheses? More specifically, can you reasonably expect that different engines will capture the outer parentheses in the first position, and nested parentheses in subsequent positions?

Consider the following PHP code (using PCRE regular expressions)

<?php
  $test_string = 'I want to test sub patterns';
  preg_match('{(I (want) (to) test) sub (patterns)}', $test_string, $matches);
  print_r($matches);
?>

Array
(
    [0] => I want to test sub patterns  //entire pattern
    [1] => I want to test           //entire outer parenthesis
    [2] => want             //first inner
    [3] => to               //second inner
    [4] => patterns             //next parentheses set
)

The entire parenthesized expression is captured first (I want to test), and then the inner parenthesized patterns are captured next ("want" and "to"). This makes logical sense, but I could see an equally logical case being made for first capturing the sub parentheses, and THEN capturing the entire pattern.

So, is this "capture the entire thing first" defined behavior in regular expression engines, or is it going to depend on the context of the pattern and/or the behavior of the engine (PCRE being different than C#'s being different than Java's being different than etc.)?

From perlrequick

If the groupings in a regex are nested, $1 gets the group with the leftmost opening parenthesis, $2 the next opening parenthesis, etc.

Caveat : Excluding non-capture group opening parenthesis (?=)

Update

I don't use PCRE much, as I generally use the real thing ;), but PCRE's docs show the same as Perl's:

SUBPATTERNS

2. It sets up the subpattern as a capturing subpattern. This means that, when the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller via the ovector argument of pcre_exec() . Opening parentheses are counted from left to right (starting from 1) to obtain number for the capturing subpatterns.

For example, if the string "the red king" is matched against the pattern

 the ((red|white) (king|queen)) 

the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively.

If PCRE is drifting away from Perl regex compatibility, perhaps the acronym should be redefined--"Perl Cognate Regular Expressions", "Perl Comparable Regular Expressions" or something. Or just divest the letters of meaning.

Yeah, this is all pretty much well defined for all the languages you're interested in:

  • Java - http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#cg
    "Capturing groups are numbered by counting their opening parentheses from left to right. ... Group zero always stands for the entire expression."
  • .Net - http://msdn.microsoft.com/en-us/library/bs2twtah(VS.71).aspx
    "Captures using () are numbered automatically based on the order of the opening parenthesis, starting from one. The first capture, capture element number zero, is the text matched by the whole regular expression pattern.")
  • PHP (PCRE functions) - http://www.php.net/manual/en/function.preg-replace.php#function.preg-replace.parameters
    "\\0 or $0 refers to the text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing subpattern." (It was also true of the deprecated POSIX functions)
  • PCRE - http://www.pcre.org/pcre.txt
    To add to what Alan M said, search for "How pcre_exec() returns captured substrings" and read the fifth paragraph that follows:

    \nThe first pair of integers, ovector[0] and ovector[1], identify the \nportion of the subject string matched by the entire pattern.  The next \npair is used for the first capturing subpattern, and so on.  The value \nreturned by pcre_exec() is one more than the highest numbered pair that \nhas been set.  For example, if two substrings have been captured, the \nreturned value is 3. If there are no capturing subpatterns, the return \nvalue from a successful match is 1, indicating that just the first pair \nof offsets has been set. \n
  • Perl's different - http://perldoc.perl.org/perlre.html#Capture-buffers
    $1, $2 etc. match capturing groups as you'd expect (ie by occurrence of opening bracket), however $0 returns the program name, not the entire query string - to get that you use $& instead.

You'll more than likely find similar results for other languages (Python, Ruby, and others).

You say that it's equally logical to list the inner capture groups first and you're right - it's just be a matter of indexing on closing, rather than opening, parens. (if I understand you correctly). Doing this is less natural though (for example it doesn't follow reading direction convention) and so makes it more difficult (probably not significantly) to determine, by insepection, which capturing group will be at a given result index.

Putting the entire match string being in position 0 also makes sense - mostly for consistency. It allows the entire matched string to remain at the same index regardless of the number capturing groups from regex to regex and regardless of the number of capturing groups that actually match anything (Java for example will collapse the length of the matched groups array for each capturing group does not match any content (think for example something like "a (.*)pattern"). You could always inspect capturing_group_results[capturing_group_results_length - 2], but that doesn't translate well to languages to Perl which dynamically create variables ($1, $2 etc.) (Perl's a bad example of course, since it uses $& for the matched expression, but you get the idea :).

Every regex flavor I know numbers groups by the order in which the opening parentheses appear. That outer groups are numbered before their contained sub-groups is just a natural outcome, not explicit policy.

Where it gets interesting is with named groups . In most cases, they follow the same policy of numbering by the relative positions of the parens--the name is merely an alias for the number. However, in .NET regexes the named groups are numbered separately from numbered groups. For example:

Regex.Replace(@"one two three four", 
              @"(?<one>\w+) (\w+) (?<three>\w+) (\w+)",
              @"$1 $2 $3 $4")

// result: "two four one three"

In effect, the number is an alias for the name ; the numbers assigned to named groups start where the "real" numbered groups leave off. That may seem like a bizarre policy, but there's a good reason for it: in .NET regexes you can use the same group name more than once in a regex. That makes possible regexes like the one from this thread for matching floating-point numbers from different locales:

^[+-]?[0-9]{1,3}
(?:
    (?:(?<thousand>\,)[0-9]{3})*
    (?:(?<decimal>\.)[0-9]{2})?
|
    (?:(?<thousand>\.)[0-9]{3})*
    (?:(?<decimal>\,)[0-9]{2})?
|
    [0-9]*
    (?:(?<decimal>[\.\,])[0-9]{2})?
)$

If there's a thousands separator, it will be saved in group "thousand" no matter which part of the regex matched it. Similarly, the decimal separator (if there is one) will always be saved in group "decimal". Of course, there are ways to identify and extract the separators without reusable named groups, but this way is so much more convenient, I think it more than justifies the weird numbering scheme.

And then there's Perl 5.10+, which gives us more control over capturing groups than I know what to do with. :D

在我使用过的所有平台上,按左括号顺序进行捕获的顺序都是标准的。(perl,php,ruby,egrep)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM