简体   繁体   English

Java正则表达式边界匹配?

[英]Java regular expression boundary match?

I found the following question in one Java test suite 我在一个Java测试套件中发现了以下问题

    Pattern p = Pattern.compile("[wow]*");
    Matcher m = p.matcher("wow its cool");
    boolean b = false;
    while (b = m.find()) {
        System.out.print(m.start() + " \"" + m.group() + "\" ");
    }

where the output seems to be as follows 输出似乎如下

0 "wow" 3 "" 4 "" 5 "" 6 "" 7 "" 8 "" 9 "oo" 11 "" 12 ""

Up till the last match it is clear, the pattern [wow]* greedily matches 0 or more 'w' and 'o' characters, while for unmatching characters, including spaces, it results in empty strings. 直到最后一场比赛很明显,模式[哇] *贪婪地匹配0或更多'w'和'o'字符,而对于不匹配的字符,包括空格,它会产生空字符串。 However after matching the last 'l' with 11 "", the following 12 "" seems to be unclear. 然而,在将最后一个'l'与11“”匹配后,以下12“”似乎不清楚。 There is no detailing for this in the test solution, nor I was really able to definitely figure it out from javadoc. 在测试解决方案中没有详细说明,我也无法从javadoc中明确地解决这个问题。 My best guess is boundary character, but I would appreciate if someone could provide an explanation 我最好的猜测是边界特征,但如果有人能提供解释,我将不胜感激

The reason that you see this behavior is that your pattern allows empty matches. 您看到此行为的原因是您的模式允许空匹配。 In other words, if you pass it an empty string, you would see a single match at position zero: 换句话说,如果你传递一个空字符串,你会在零位置看到一个匹配:

Pattern p = Pattern.compile("[wow]*"); // One of the two 'w's is redundant, but the engine is OK with it
Matcher m = p.matcher("");             // Passing an empty string results in a valid match that is empty
boolean b = false;
while (b = m.find()) {
    System.out.print(m.start() + " \"" + m.group() + "\" ");
}

this would print 0 "" because an empty string is as good a match as any other match for the expression. 这将打印0 ""因为空字符串与表达式的任何其他匹配一样好。

Going back to your example, every time the engine discovers a match, including an empty one, it advances past it by a single character. 再回到你的例子,每当引擎发现一个匹配项(包括一个空的匹配项)时,它会通过一个字符前进。 "Advancing by one" means that the engine considers the "tail" of the string at the next position. “前进一个”意味着引擎在下一个位置考虑弦的“尾部”。 This includes the time when the regex engine is at position 11, ie at the very last character: here, the "tail" consists of an empty string. 这包括正则表达式引擎处于位置11的时间,即最后一个字符的时间:这里,“尾部”由空字符串组成。 This is similar to calling "wow its cool".substring(12) : you would get an empty string in that case as well. 这类似于调用"wow its cool".substring(12) :在这种情况下你也会得到一个空字符串。

The engine consider an empty string a valid input, and tries to match it against your expression, as shown in the example above. 引擎将空字符串视为有效输入,并尝试将其与表达式匹配,如上例所示。 This produces a match, which your program properly reports. 这会产生匹配,您的程序会正确报告。

  • [wow]* Matches the first wow string. [wow]*匹配第一个wow弦。 count = 1 count = 1

  • Because of the * ( zero or more ) next to the character class, [wow]* this regex would match an empty string which exists before the character which is not matched by the above pattern. 由于字符类旁边的*零或更多 ), [wow]*这个正则表达式将匹配一个空字符串,该字符串存在于与上述模式不匹配的字符之前。 So it matches the boundary or empty space which exists just before to the first space. 因此它匹配前面存在于第一个空间的边界或空白空间。 Count = 2. 数= 2。

  • its is not matched by the above regex . its与上述正则表达式不匹配。 So it matches the empty string which exists before each character. 所以它匹配每个字符之前存在的空字符串。 So count is 2+3=5 . 因此计数是2+3=5

  • And also the second space is not matched by the above regex. 并且第二个空间与上述正则表达式不匹配。 So we get an empty string as match. 所以我们得到一个空字符串作为匹配。 5+1=6

  • c is not matched by the above regex. c与上述正则表达式不匹配。 So it matches the empty space which exists just before to the c 6+1=7 所以它匹配之前存在的空白空间到c 6+1=7

  • oo is matched by the above regex. oo与上述正则表达式匹配。 [wow]* . [wow]* So it matches oo and this is considered as 1 match . 所以它匹配oo ,这被认为是1匹配。 So we get 7+1=8 as count. 所以我们得到7+1=8作为计数。

  • l is not matched. l不匹配。 Count = 9 数= 9

  • At the last it matches the empty string which exists next to the last character. 最后它匹配最后一个字符旁边的空字符串。 So now the count is 9+1=10 所以现在计数是9+1=10

  • And finally we all know that the m.start() prints the starting index of the corresponding match. 最后我们都知道m.start()打印相应匹配的起始索引。

DEMO DEMO

The regex is simply matching the pattern against the input, starting at a given offset. 正则表达式简单地将模式与输入匹配,从给定的偏移量开始。 For the last match, the offset of 12 is at the point after the last character of 'cool' - you might think this is the end of the string and therefore cannot be used for matching purposes - but you'd be wrong. 对于最后一场比赛,12的偏移量是在'酷'的最后一个字符之后的位置 - 你可能认为这是字符串的结尾,因此不能用于匹配目的 - 但你错了。 For pattern-matching, this is a perfectly valid starting point. 对于模式匹配,这是一个非常有效的起点。

As you state, your regex expression includes the possibility of zero characters and indeed, this is what happens after the end of the last character, but before the end-of-string marker (usually represented by $ in a regex expression). 正如您所述,您的正则表达式包含零字符的可能性,实际上,这是在最后一个字符结束之后但在字符串结束标记之前发生的事情(通常在正则表达式中由$表示)。

To put it another way, without testing past the end of the last character, it would mean no matches would ever occur relating to the end of the string - but there are many regex constructs that match the end of the string (and you've shown one of them here). 换句话说,没有测试超过最后一个字符的结尾,这意味着不会发生与字符串结尾相关的匹配 - 但是有许多正则表达式构造匹配字符串的结尾(并且你已经在这里显示其中一个)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM