简体   繁体   English

Java Matcher组:理解“(?:X | Y)”和“(?:X)|(?:Y)”之间的区别

[英]Java Matcher groups: Understanding The difference between “(?:X|Y)” and “(?:X)|(?:Y)”

Can anyone explain: 谁能解释一下:

  1. Why the two patterns used below give different results? 为什么下面使用的两种模式给出不同的结果? (answered below) (以下回答)
  2. Why the 2nd example gives a group count of 1 but says the start and end of group 1 is -1? 为什么第二个例子的组计数为1,但是组1的开始和结束是-1?
 public void testGroups() throws Exception
 {
  String TEST_STRING = "After Yes is group 1 End";
  {
   Pattern p;
   Matcher m;
   String pattern="(?:Yes|No)(.*)End";
   p=Pattern.compile(pattern);
   m=p.matcher(TEST_STRING);
   boolean f=m.find();
   int count=m.groupCount();
   int start=m.start(1);
   int end=m.end(1);

   System.out.println("Pattern=" + pattern + "\t Found=" + f + " Group count=" + count + 
     " Start of group 1=" + start + " End of group 1=" + end );
  }

  {
   Pattern p;
   Matcher m;

   String pattern="(?:Yes)|(?:No)(.*)End";
   p=Pattern.compile(pattern);
   m=p.matcher(TEST_STRING);
   boolean f=m.find();
   int count=m.groupCount();
   int start=m.start(1);
   int end=m.end(1);

   System.out.println("Pattern=" + pattern + "\t Found=" + f + " Group count=" + count + 
     " Start of group 1=" + start + " End of group 1=" + end );
  }

 }

Which gives the following output: 其中给出了以下输出:

Pattern=(?:Yes|No)(.*)End  Found=true Group count=1 Start of group 1=9 End of group 1=21
Pattern=(?:Yes)|(?:No)(.*)End  Found=true Group count=1 Start of group 1=-1 End of group 1=-1
  1. The difference is that in the second pattern "(?:Yes)|(?:No)(.*)End" , the concatenation ("X followed by Y" in "XY") has higher precedence than the choice ("Either X or Y" in "X|Y"), like multiplication has higher precedence than addition, so the pattern is equivalent to 不同之处在于,在第二种模式"(?:Yes)|(?:No)(.*)End" ,连接(“XY”中的“X后跟Y”) 优先于选择(“Either” X或Y“在”X | Y“中,与乘法一样,优先级高于加法,因此模式相当于

     "(?:Yes)|(?:(?:No)(.*)End)" 

    What you wanted to get is the following pattern: 你想得到的是以下模式:

     "(?:(?:Yes)|(?:No))(.*)End" 

    This yields the same output as your first pattern. 这产生与第一个模式相同的输出。

    In your test, the second pattern has the group 1 at the (empty) range [-1, -1[ because that group did not match (the start -1 is included, the end -1 is excluded, making the half-open interval empty). 在您的测试中,第二个模式在(空)范围中具有组1 [-1, -1[因为该组不匹配(包括起始-1,排除结束-1,使半开放)间隔为空)。

  2. A capturing group is a group that may capture input. 捕获组可以捕获输入的组。 If it captures, one also says it matches some substring of the input. 如果它捕获,则还表示它匹配输入的某些子串。 If the regex contains choices, then not every capturing group may actually capture input, so there may be groups that do not match even if the regex matches. 如果正则表达式包含选项,则不是每个捕获组都可能实际捕获输入,因此即使正则表达式匹配,也可能存在不匹配的组。

  3. The group count, as returned by Matcher.groupCount() , is gained purely by counting the grouping brackets of capturing groups , irrespective of whether any of them could match on any given input. Matcher.groupCount()返回的组计数纯粹是通过计算捕获组的分组括号来获得的 ,而不管它们中的任何一个是否可以匹配任何给定的输入。 Your pattern has exactly one capturing group: (.*) . 您的模式只有一个捕获组: (.*) This is group 1. The documentation states : 这是第1组。 文档说明

     (?:X) X, as a non-capturing group 

    and explains : 解释说

    Groups beginning with (? are either pure, non-capturing groups that do not capture text and do not count towards the group total, or named-capturing group. (?开头的组是纯粹的非捕获组,不捕获文本,不计入组总数或命名捕获组。

    Whether any specific group matches on a given input, is irrelevant for that definition. 任何特定组是否与给定输入匹配,与该定义无关。 Eg, in the pattern (Yes)|(No) , there are two groups ( (Yes) is group 1, (No) is group 2), but only one of them can match for any given input. 例如,在模式(Yes)|(No) ,有两个组( (Yes)是组1, (No)是组2),但是只有一个组可以匹配任何给定的输入。

  4. The call to Matcher.find() returns true if the regex was matched on some substring. 如果正则表达式在某些子字符串上匹配,则对Matcher.find()的调用将返回true。 You can determine which groups matched by looking at their start: If it is -1, then the group did not match. 您可以通过查看其开头来确定匹配的组:如果它是-1,则该组不匹配。 In that case, the end is at -1, too. 在这种情况下,结尾也是-1。 There is no built-in method that tells you how many capturing groups actually matched after a call to find() or match() . 没有内置方法可以告诉您在调用find()match()之后实际匹配了多少个捕获组。 You'd have to count these yourself by looking at each group's start. 你必须通过观察每个小组的开始来自己计算。

  5. When it comes to backreferences, also note what the regex tutorial has to say: 在反向引用时,还要注意正则表达式教程的含义:

    There is a difference between a backreference to a capturing group that matched nothing, and one to a capturing group that did not participate in the match at all. 对没有匹配的捕获组的反向引用与根本没有参与匹配的捕获组之间存在差异。

To summarise, 总结一下,

1) The two patterns give different results because of the precedence rules of the operators. 1)由于运算符的优先级规则,这两种模式给出了不同的结果。

  • (?:Yes|No)(.*)End matches (Yes or No) followed by .*End (?:Yes|No)(.*)End匹配(是或否)后跟。*结束
  • (?:Yes)|(?:No)(.*)End matches (Yes) or (No followed by .*End) (?:Yes)|(?:No)(.*)End匹配(是)或(否后跟。*结束)

2) The second pattern gives a group count of 1 but a start and end of -1 because of the (not necessarily intuitive) meanings of the results returned by the Matcher method calls. 2)由于Matcher方法调用返回的结果(不一定是直观的)含义,第二个模式给出组计数为1但开始和结束为-1。

  • Matcher.find() returns true if a match was found. 如果找到匹配项, Matcher.find()将返回true。 In your case the match was on the (?:Yes) part of the pattern. 在你的情况下,比赛是在模式的(?:Yes)部分。
  • Matcher.groupCount() returns the number of capturing groups in the pattern regardless of whether the capturing groups actually participated in the match . 无论捕获组是否实际参与匹配, Matcher.groupCount()返回模式中捕获组的数量。 In your case only the non capturing (?:Yes) part of the pattern participated in the match, but the capturing (.*) group was still part of the pattern so the group count is 1. 在您的情况下,只有模式的非捕获(?:Yes)部分参与了匹配,但捕获(.*)组仍然是模式的一部分,因此组计数为1。
  • Matcher.start(n) and Matcher.end(n) return the start and end index of the subsequence matched by the n th capturing group. Matcher.start(n)Matcher.end(n)返回第n个捕获组匹配的子序列的开始和结束索引。 In your case, although an overall match was found, the (.*) capturing group did not participate in the match and so did not capture a subsequence, hence the -1 results. 在您的情况下,虽然找到了整体匹配,但(.*)捕获组没有参与匹配,因此没有捕获子序列,因此-1结果。

3) (Question asked in comment.) In order to determine how many capturing groups actually captured a subsequence, iterate Matcher.start(n) from 0 to Matcher.groupCount() counting the number of non -1 results. 3)(在评论中提出问题。)为了确定实际捕获子序列的捕获组的数量,将Matcher.start(n)从0迭代到Matcher.groupCount()计算非-1结果的数量。 (Note that Matcher.start(0) is the capturing group representing the whole pattern, which you may want to exclude for your purposes.) (请注意, Matcher.start(0)是表示整个模式的捕获组,您可能希望将其排除在外。)

Due to the precedence of the "|" 由于“|”的优先顺序 operator in the pattern, the second pattern is equivalent to: 在模式中的运算符,第二个模式相当于:

(?:Yes)|((?:No)(.*)End)

What you want is 你想要的是什么

(?:(?:Yes)|(?:No))(.*)End

When using regular expression is it important to remember there there is an implicit AND operator at work. 当使用正则表达式时,记住那里有一个隐含的AND运算符很重要。 This can be seen from the JavaDoc for java.util.regex.Pattern covering the logical operators: 这可以从JavaDoc java.util.regex.Pattern看到java.util.regex.Pattern涵盖了逻辑运算符:

Logical operators 逻辑运算符
XY X followed by Y XY X后跟Y.
X|Y Either X or Y X | Y X或Y.
(X) X, as a capturing group (X)X,作为捕获组

This AND takes precedence over the OR in the second Pattern. AND优先于第二个Pattern中的OR The second Pattern is equivalent to 第二个模式相当于
(?:Yes)|(?:(?:No)(.*)End) . (?:Yes)|(?:(?:No)(.*)End)
In order for it to be equivalent to the first Pattern it must be changed to 为了使它等同于第一个Pattern,必须将其更改为
(?:(?:Yes)|(?:No))(.*)End

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM