简体   繁体   English

正则表达式中的捕获组和模式拆分方法

[英]Capturing groups and Pattern split method in regular expression

How can I understand the output of the below code?我如何理解以下代码的输出? The code's first four print statements are about the Capturing Groups in Regular Expression in Java and the rest of the code is about the Pattern split method.代码的前四个打印语句是关于 Java 中正则表达式中的捕获组,其余代码是关于Pattern split方法。 I referred a few documents to perceive the code's output (shown in the pic) but could not figured it out how exactly it's working and showing this output.我参考了一些文档来感知代码的输出(如图所示),但无法弄清楚它是如何工作并显示此输出的。

Java Code Java代码

    import java.util.*;
    import java.util.regex.*;
    import java.lang.*;
    import java.io.*;

    /* Name of the class has to be "Main" only if the class is public. */
    public class Codechef
    {
        public static void main(String[] args) {
            //Capturing Group in Regular Expression
            System.out.println(Pattern.matches("(\\w\\d)\\1", "a2a2")); //true
            System.out.println(Pattern.matches("(\\w\\d)\\1", "a2b2")); //false
            System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B2AB")); //true
            System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B3AB")); //false
            // using pattern split method
            Pattern pattern = Pattern.compile("\\W");
            String[] words = pattern.split("one@two#three:four$five");
            System.out.println(words);
            for (String s : words) {
                System.out.println("Split using Pattern.split(): " + s);
            }

        }
    }

Results结果

在此处输入图片说明

Edit-1编辑-1

Queries查询

  • If I talk about Capturing Groups, I cannot figure out what's use of '\\1' or '\\2' here?如果我谈论捕获组,我无法弄清楚这里的 '\\1' 或 '\\2' 有什么用? How these are evaluating to true or false.这些如何评估为真或假。
  • If I talk about Pattern split method, I wish to know how the string split is happening.如果我谈论模式拆分方法,我想知道字符串拆分是如何发生的。 How does this split method work differently than a normal string split method?这种拆分方法与普通字符串拆分方法的工作方式有何不同?

The first console print lines...第一个控制台打印行...

System.out.println(Pattern.matches("(\\w\\d)\\1", "a2a2")); //true
System.out.println(Pattern.matches("(\\w\\d)\\1", "a2b2")); //false
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B2AB")); //true
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B3AB")); //false

utilizes the matches() method which always returns a boolean (true or false).利用matches()方法,它总是返回一个布尔值(真或假)。 This method is mostly used for String validation of one sort or another.此方法主要用于一种或另一种字符串验证。 Taking the first and second example regular expressions which both are: "(\\\\w\\\\d)\\\\1" and then work that expression against the two supplied strings ( "a2a2" and "a2b2" ) though the matches() method as they have done you will definitely be returned a boolean true and a false in that order.以第一个和第二个正则表达式示例为例,它们都是: "(\\\\w\\\\d)\\\\1"然后通过matches()方法针对两个提供的字符串( "a2a2""a2b2" )处理该表达式正如他们所做的那样,您肯定会按顺序返回布尔值truefalse

The real key here is knowing what that particular Regular Expression is suppose to validate.这里真正的关键是知道什么特定的正则表达式是假设进行验证。 The expression above is only working against 1 Capturing Group which is denoted by the parentheses.上面的表达式仅适用于括号中表示的 1 个捕获组。 The \\\\w is used for matching any single word character which is equal to az or AZ or 0-9 and _ (the underscore character). \\\\w用于匹配任何等于azAZ0-9_ (下划线字符)的单个单词字符 The \\\\d is used for matching a single digit equal to any number from 0 to 9 . \\\\d用于匹配单个数字,等于09 之间的任何数字。

Note: In reality the expression Meta characters are written as \\w and \\d but because the Escape Character ( \\ ) in Java Strings need to be escaped you have to add an additional Escape Character.注意:实际上,表达式Meta 字符写为\\w\\d,但由于 Java 字符串中的转义字符 ( \\ ) 需要转义,因此您必须添加额外的转义字符。

The \\1 is used to see if there is a single match of the same text as most recently matched by the 1st capturing group. \\1用于查看是否存在与第一个捕获组最近匹配的相同文本的单个匹配项。 Since there is only one capturing group specified you can only use a value of 1 here.由于只指定了一个捕获组,因此此处只能使用值 1。 Well, that's not entirely true, you could use the value of 0 here but then your not looking for a match in any capturing group which eliminates the purpose here.嗯,这并不完全正确,您可以在此处使用0值,但是您不会在任何捕获组中寻找匹配项,这消除了此处的目的。 Any other value greater than 1 would create a expression exception since you have only 1 Capturing Group.任何其他大于1 的值都会创建表达式异常,因为您只有 1 个捕获组。

Bottom line, The expression looks at the first two characters within the supplied string:底线,表达式查看所提供字符串中的前两个字符:

  • Is the first character ( \\\\w ) within the supplied string a upper or lower case A to Z or _ or a number from 0 to 9 ?所提供字符串中的第一个字符 ( \\\\w ) 是大写还是小写A 到 Z_0 到 9 之间的数字 If it isn't then there is no match and boolean false is returned but, if there is then.....如果不是,则没有匹配项,并且返回布尔值false ,但是,如果存在.....
  • Is the second character ( \\\\d ) within the supplied string a digit from 0 to 9 ?提供的字符串中的第二个字符 ( \\\\d ) 是0 到 9 之间的数字吗? If it isn't then boolean false is returned but, if there is then....如果不是,则返回布尔值,但是,如果有则......
  • Are the remaining 2 characters exactly the same (including letter case if az or AZ are used).其余 2 个字符是否完全相同(如果使用azAZ,则包括字母大小写)。 If the remaining 2 characters are not identical or there are more than two remaining characters then boolean false is returned.如果剩余的 2 个字符不相同或剩余的字符多于两个,则返回布尔值false If however those two remaining characters are identical then return boolean true .但是,如果剩下的两个字符相同,则返回 boolean true

Basically, the expression is merely used to validate that the Last Two characters within the supplied String match the First Two characters of the same supplied String.基本上,表达仅用于验证所提供的字符串内的最后两个字符匹配相同提供的字符串的前两个字符。 This is why the second console print:这就是第二个控制台打印的原因:

System.out.println(Pattern.matches("(\\w\\d)\\1", "a2b2")); //false

returns a boolean false , b2 is not the same as a2 whereas in the first console print:返回一个布尔值false, b2一样的a2 ,而在第一个控制台打印:

System.out.println(Pattern.matches("(\\w\\d)\\1", "a2a2")); //true

the Last Two characters a2 do indeed match the First Two characters a2 and therefore boolean true is returned.最后两个字符a2确实与前两个字符a2匹配,因此返回布尔值true

You will now notice that in the other two console prints:您现在会注意到在其他两个控制台打印中:

System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B2AB")); //true
System.out.println(Pattern.matches("(AB)(B\\d)\\2\\1", "ABB2B3AB")); //false

the Regular Expression used contains 2 Capture Groups (two sets of parentheses).使用的正则表达式包含2 个捕获组(两组括号)。 The same sort of matching applies here but against two capture groups instead of one like the first two console prints.相同类型的匹配在这里适用,但针对两个捕获组,而不是像前两个控制台打印那样的一个。

If you want to see how these Regular Expressions play out and get explanations on what the expressions mean then use Regular Expression Tester at regex101.com .如果您想了解这些正则表达式如何发挥作用并获得有关表达式含义的解释,请使用regex101.com 上的正则表达式测试 This is also a good Regular Expressions resource.这也是一个很好的正则表达式资源。

Pattern.split(): Pattern.split():

In this case, the use of the Pattern.split() method is a little overkill in my opinion since String.split() accepts Regular Expressions but does have it's purpose in other areas.在这种情况下,在我看来,使用Pattern.split()方法有点矫枉过正,因为String.split()接受正则表达式,但在其他领域确实有它的用途。 Never the less it is a good example of how it can be used.无论如何,它是如何使用它的一个很好的例子。 The .split() method is used here to carry out the grouping based on the String that was supplied to it and what was deemed as the Regular Expression through Pattern which in this case is "\\\\W" (otherwise: \\W ). .split()方法在此处用于根据提供给它的 String 以及通过 Pattern 被视为正则表达式的内容执行分组,在这种情况下为“\\\\W” (否则为: \\W )。 The \\W (uppercase W ) means 'match any non-word character which is not equal to az or AZ or 0-9 or _ . \\W (大写W )表示“匹配任何不等于azAZ0-9_ 的非单词字符。 This expression is basically the opposite of "\\w" (with the lowercase w ).这个表达式基本上与“\\w”相反(带有小写的w )。 The characters @ , # , : , and $ contained within the supplied String (yes... the comma, semicolon, exclamation, etc):提供的字符串中包含的字符@#:$ (是的...逗号、分号、感叹号等):

"one@two#three:four$five"

are considered non-word characters and therefore the split is carried out on any one of them resulting in a String Array containing:被认为是非单词字符,因此对它们中的任何一个进行拆分,从而产生一个包含以下内容的字符串数组:

[one, two, three, four, five]

The very same thing can be accomplished doing it this way using the String.split() method since tis method allows for a Regular Expression to be applied:使用String.split()方法可以通过这种方式完成同样的事情,因为 tis 方法允许应用正则表达式:

String[] s = "one@two#three;four$five".split("\\W");

or even:甚至:

String[] s = "one@two#three;four$five".split("[@#:$]");

or even:甚至:

String[] s = "one@two#three;four$five".split("@|#|:|\\$");
// The $ character is a reserved RegEx symbol and therefore
// needs to be escaped.

or on and on and on...或者………………

Yup... "\\\\W" is easier since it covers all non-word characters.是的...... “\\\\W”更容易,因为它涵盖了所有非单词字符。 ;) ;)

If i talk about Capturing Groups, I cannot figure out what is usage of '\\1' or '\\2' here?如果我谈论捕获组,我无法弄清楚这里 '\\1' 或 '\\2' 的用法是什么? How these are evaluating to true or false.这些如何评估为真或假。

Answer:回答:

  • \\\\1 repeats the first captured group (ie a2 captured by (\\\\w\\\\d) ) \\\\1重复第一个捕获的组(即a2捕获的(\\\\w\\\\d)
  • \\\\2 repeats the second captured group (ie B2 captured by (B\\\\d) ) \\\\2重复第二个捕获的组(即B2捕获的(B\\\\d)

The actual name for those combinations is backreferences :这些组合的实际名称是反向引用

The section of the input string matching the capturing group(s) is saved in memory for later recall via backreference.与捕获组匹配的输入字符串部分保存在内存中,以便以后通过反向引用调用。 A backreference is specified in the regular expression as a backslash () followed by a digit indicating the number of the group to be recalled.反向引用在正则表达式中指定为反斜杠 () 后跟一个数字,表示要调用的组的编号。


If i talk about Pattern split method, I wish to know how the string split is happening.如果我谈论模式拆分方法,我想知道字符串拆分是如何发生的。 How does this split method work differently than a normal string split method?这种拆分方法与普通字符串拆分方法的工作方式有何不同?

Answer :回答

The split() method in the Pattern class can split a text into an array of String's, using the regular expression (the pattern) as delimiter Pattern 类中的 split() 方法可以将文本拆分为字符串数组,使用正则表达式(模式)作为分隔符

Rather than explicitly split a string using a fixes string or character, here you provide a regex, which is much more powerful and elastic.与使用修复字符串或字符显式拆分字符串不同,这里您提供了一个更强大和更有弹性的正则表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM