简体   繁体   中英

How does this regex match into groups

Looking at this ^\\s*(_?)(\\S+?)\\1\\s*$ regular expression from injector.js .

I have been able to understand how the string _non_ is matched. The first capturing group consists of _ , the second group consists of non and the reference to the result of the first capture group gets you an _ . So,the first group is _ , the second group is non and the third group is _ .

However, I have not been able to understand how the strings _ , _non and __ are matched by the second group given the reference to the \\1 in the expression which would expect an _ at the end given an _ at the beginning.

Pattern: ^\\s*(_?)(\\S+?)\\1\\s*$

Overall, this pattern:

^ start at the beginning of the string

\\s* match 0 or more whitespace chars

(_?) match and capture 0 or 1 underscore (capture group 1)

(\\S+?) non-greedy match and capture 1 or more non-whitespace char (capture group 2)

\\1 match for what was matched in capture group 1

\\s* match 0 or more whitespace chars

$ match end of line/string

Subject: _

Group 1:

Group 2: _

Initially this will be matched in the first capture group. But then the engine moves on to the 2nd capture group and it expects at least one char to match, so the engine backtracks and takes the char from the first capture group because the ? in the first capture group makes it optional, and _ is a non-space char. Then, since ultimately nothing was matched in capture group 1 (because group 2 had to be satisfied), there is nothing to match in the \\1 back-reference.

Subject: _non

Group 1:

Group 2: _non

Initially the _ is matched in group 1, then non is matched in group 2. Then the engine looks for a _ for that \\1 reference, and there is none, so the engine backtracks and matches removes it from group 1 and matches it in group 2.

Subject: _non_

Group 1: _

Group 2: non

Similar to the previous: Initially the _ is matched in group 1, then non is matched in group 2. Then the engine looks for a _ for that \\1 reference, which it matches, so group 1 keeps its _ and group 2 just has non .

Subject: __

Group 1:

Group 2: __

This is essentially same as the first _ example. Initally the first _ is matched in group 1. Then the 2nd _ is matched in group 2. then \\1 tries to match for another _ since group 1 got one, but there is none. But group 2 requires at least 1 char, but can have more, so regex engine backs up and puts group 1's match into group 2.

Subject: _ _

Group 1:

Group 2:

This results in no match. The engine starts out putting the first _ into group 1, but then fails at putting the space in group 2. So it backs up and attempts to put the first _ into group 2. Since there's no group 1, there is also no \\1 to match. The space is then matched by \\s* but then the match fails on the final _ because the pattern says only spaces before the end of string.

Sidenote

You asked in a comment:

if it matches the _ for the first group does it have to match an _ in the \\1 .Does \\1 it refer to the expression or the result of the expression?

It references the result of the expression (what is actually captured), not the expression itself.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM