简体   繁体   中英

Java Regular Expression pattern match

I want to understand how the below Java regular expression program worked. I am not able understand the second line in the output of the program

String line = "This order was placed for QT3000! OK?";
      String pattern = "(.*)(\\d+)(.*)";
Pattern r = Pattern.compile(pattern);

  // Now create matcher object.
  Matcher m = r.matcher(line);
  if (m.find( )) {
     System.out.println("Found value: " + m.group(0) );
     System.out.println("Found value: " + m.group(1) );
     System.out.println("Found value: " + m.group(2) );

This produces an output like this

Found value: This order was placed for QT3000! OK?
Found value: This order was placed for QT300
Found value: 0

I understand that the pattern we are searching for in the string is a sequence that is a number ( \\d+ ) with anything before (.*) and after it (.*) . Please correct me if I am wrong here.

I understood that m.group(0) returns the whole string. I didn't understand the second line of the output. Found value: This order was placed for QT300 . What is happening here?

It's returning the match produced from the first capturing group ( ... ) . And since * by default is a greedy operator, it's matching everything up until the last digit in the character string.

Breaking it down:

在此处输入图片说明

m.group(0)  →  Entire match     →  (.*)(\\d+)(.*) // This order was placed for QT3000! OK?
m.group(1)  →  Capture Group 1  →  (.*)           // This order was placed for QT300
m.group(2)  →  Capture Group 2  →  (\\d+)         // 0
m.group(3)  →  Capture Group 3  →  (.*)           // ! OK?

This is due to both greedy (as many as possible) and docile (give back when needed) from the regex. ( Greedy... but Docile )

  • But if the quantified token has matched so many characters that the rest of the pattern can not match, the engine will backtrack to the quantified token and make it give up characters it matched earlier—one character or chunk at a time, depending on whether the quantifier applies to a single character or to a subpattern that can match chunks of several characters. After giving up each character or chunk, the engine tries once again to match the rest of the pattern. I call this behavior of greedy quantifiers docile.

Hence it pretty explains the situation u got there.

  1. 0 Group : Match All
  2. 1st Group : (.*) [Greedy] Match All but will backtrack to the following quantified token (\\d+) This order was placed for QT300
  3. 2nd Group : (\\d+) [Greedy] At least one digit 0
  4. 3rd : (.*) [Greedy] ! OK? ! OK?

To understand better if you change the one to unlimited (\\d+) to zero to unlimited (\\d*), the Greedy behavior from Group 1 will take it all.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM