简体   繁体   中英

Regular expressions - capturing groups confusion

I am reading an Oracle tutorial on regular expressions. I am on the topic Capturing groups . Though the reference is excellent, but except that a parenthesis represents a group, I am finding many difficulties in understanding the topic. Here are my confusions.

  1. What is the significance of counting groups in an expression?
  2. What are non-capturing groups?

Elaborating with examples would be nice.

  1. One usually doesn't count groups other than to know which group has which number. Eg ([abc])([def](\\d+)) has three groups, so I know to refer to them as \\1 , \\2 and \\3 . Note that group 3 is inside 2. They are numbered from the left by where they begin .
  2. When searching with regex to find something in a string, as opposed to matching when you make sure the whole string matches the subject, group 0 will give you just the matched string, but not the stuff that was before or after it. Imagine if you will a pair of brackets around your whole regex. It's not part of the total count because it's not really considered a group.
  3. Groups can be used for other things than capturing. Eg (foo|bar) will match "foo" or "bar" . If you're not interested in the contents of a group, you can make it non-capturing (eg: (?:foo|bar) (varies by dialect)), so as not to "use up" the numbers assigned to groups. But you don't have to , it's just convenient sometimes.
  4. Say I want to find a word that begins and ends in the same letter: \\b([az])[az]*\\1\\b The \\1 will then be the same as whatever the first group captured. Of course it can be used for much more powerful stuff, but I think you'll get the idea.

(Coming up with relevant examples is certainly the hardest part.)

Edit: I answered when the questions were:

  1. What is the significance of counting groups in an expression?
  2. There is a special group, called as group-0, which means the entire expression. It is not reported by groupCount() method. Why is that?
  3. I don't understand what are non-capturing groups?
  4. Why we need back-references? What is the significance of back-references?

Say you have a string, abcabc , and you want to figure out whether the first part of the string matches the second part. You can do this with a single regex by using capturing groups and backreferences. Here is the regex I would use:

(.+)\1

The way this works is .+ matches any sequence of characters. Because it is in parentheses, it is caught in a group. \\1 is a backreference to the 1 st capturing group, so it is the equivalent of the text caught by the capturing group. After a bit of backtracking, the capturing group matches the first part of the string, abc . The backreference \\1 is now the equivalent of abc , so it matches the second half of the string. The entire string is now matched, so it is confirmed that the first half of the string matches the second half.


Another use of backreferences is in replacing. Say you want to replace all {...} with [...] , if the text inside { and } is only digits. You can easily do this with capturing groups and backreferences, using the regex

{(\d+)}

And replacing with that with [\\1] .

The regex matches {123} in the string abc {123} 456 , and captures 123 in the first capturing group. The backreference \\1 is now the equivalent of 123 , so replacing {(\\d+)} in abc {123} 456 with [\\1] results in abc [123] 456 .


The reason non-capturing groups exist is because groups in general have more uses that just capturing. The regex (xyz)+ matches a string that consists entirely of the group, xyz , repeated, such as xyzxyzxyz . A group is needed because xyz+ only matches xy and then z repeated, ie xyzzzzz . The problem with using capturing groups is that they are slightly less efficient compared to non-capturing groups, and they take up an index. If you have a complicated regex with a lot of groups in it, but you only need to reference a single one somewhere in the middle, it's a lot better to just reference \\1 rather than trying to count all the groups up to the one you want.

I hope this helps!

  1. Can't think of an appropriate example at the moment, but I'm assuming someone might need to know the number of sub matches in the RegEx.
  2. Group 0 is always the entire base match. I'm assuming groupCount() just lets you know how many capture groups you've specified in the expression.
  3. A non-capturing group (?:) would be used to, well, not capture a group. Ex. if you need to test if a string contains one of several words and don't want to capture the word in a new group: (?:hello|hi there) world !== hello|hi there world . The first matches "hello world" or "hi there world" but the second matches "hello" or "hi there world".
  4. They can be used as a part of a multitude of powerful reasons, such as testing whether or not a number is prime or composite. :) Or you could simply test to ensure a search parameter isn't repeated, ie. ^(\\d)(?!.*\\1)\\d+$ would ensure the first digit is unique in a string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM