简体   繁体   中英

Java regular expression boundary match?

I found the following question in one Java test suite

    Pattern p = Pattern.compile("[wow]*");
    Matcher m = p.matcher("wow its cool");
    boolean b = false;
    while (b = m.find()) {
        System.out.print(m.start() + " \"" + m.group() + "\" ");
    }

where the output seems to be as follows

0 "wow" 3 "" 4 "" 5 "" 6 "" 7 "" 8 "" 9 "oo" 11 "" 12 ""

Up till the last match it is clear, the pattern [wow]* greedily matches 0 or more 'w' and 'o' characters, while for unmatching characters, including spaces, it results in empty strings. However after matching the last 'l' with 11 "", the following 12 "" seems to be unclear. There is no detailing for this in the test solution, nor I was really able to definitely figure it out from javadoc. My best guess is boundary character, but I would appreciate if someone could provide an explanation

The reason that you see this behavior is that your pattern allows empty matches. In other words, if you pass it an empty string, you would see a single match at position zero:

Pattern p = Pattern.compile("[wow]*"); // One of the two 'w's is redundant, but the engine is OK with it
Matcher m = p.matcher("");             // Passing an empty string results in a valid match that is empty
boolean b = false;
while (b = m.find()) {
    System.out.print(m.start() + " \"" + m.group() + "\" ");
}

this would print 0 "" because an empty string is as good a match as any other match for the expression.

Going back to your example, every time the engine discovers a match, including an empty one, it advances past it by a single character. "Advancing by one" means that the engine considers the "tail" of the string at the next position. This includes the time when the regex engine is at position 11, ie at the very last character: here, the "tail" consists of an empty string. This is similar to calling "wow its cool".substring(12) : you would get an empty string in that case as well.

The engine consider an empty string a valid input, and tries to match it against your expression, as shown in the example above. This produces a match, which your program properly reports.

  • [wow]* Matches the first wow string. count = 1

  • Because of the * ( zero or more ) next to the character class, [wow]* this regex would match an empty string which exists before the character which is not matched by the above pattern. So it matches the boundary or empty space which exists just before to the first space. Count = 2.

  • its is not matched by the above regex . So it matches the empty string which exists before each character. So count is 2+3=5 .

  • And also the second space is not matched by the above regex. So we get an empty string as match. 5+1=6

  • c is not matched by the above regex. So it matches the empty space which exists just before to the c 6+1=7

  • oo is matched by the above regex. [wow]* . So it matches oo and this is considered as 1 match . So we get 7+1=8 as count.

  • l is not matched. Count = 9

  • At the last it matches the empty string which exists next to the last character. So now the count is 9+1=10

  • And finally we all know that the m.start() prints the starting index of the corresponding match.

DEMO

The regex is simply matching the pattern against the input, starting at a given offset. For the last match, the offset of 12 is at the point after the last character of 'cool' - you might think this is the end of the string and therefore cannot be used for matching purposes - but you'd be wrong. For pattern-matching, this is a perfectly valid starting point.

As you state, your regex expression includes the possibility of zero characters and indeed, this is what happens after the end of the last character, but before the end-of-string marker (usually represented by $ in a regex expression).

To put it another way, without testing past the end of the last character, it would mean no matches would ever occur relating to the end of the string - but there are many regex constructs that match the end of the string (and you've shown one of them here).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM