简体   繁体   English

使用正则表达式匹配多行文本

[英]Match multiline text using regular expression

I am trying to match a multi line text using java.我正在尝试使用 java 匹配多行文本。 When I use the Pattern class with the Pattern.MULTILINE modifier, I am able to match, but I am not able to do so with (?m).当我将Pattern类与Pattern.MULTILINE修饰符一起使用时,我能够匹配,但我不能用(?m).来匹配(?m).

The same pattern with (?m) and using String.matches does not seem to work.使用(?m)和使用String.matches的相同模式似乎不起作用。

I am sure I am missing something, but no idea what.我确定我错过了一些东西,但不知道是什么。 Am not very good at regular expressions.我不是很擅长正则表达式。

This is what I tried这是我试过的

String test = "User Comments: This is \t a\ta \n test \n\n message \n";

String pattern1 = "User Comments: (\\W)*(\\S)*";
Pattern p = Pattern.compile(pattern1, Pattern.MULTILINE);
System.out.println(p.matcher(test).find());  //true

String pattern2 = "(?m)User Comments: (\\W)*(\\S)*";
System.out.println(test.matches(pattern2));  //false - why?

First, you're using the modifiers under an incorrect assumption.首先,您在错误的假设下使用了修饰符。

Pattern.MULTILINE or (?m) tells Java to accept the anchors ^ and $ to match at the start and end of each line (otherwise they only match at the start/end of the entire string). Pattern.MULTILINE(?m)告诉 Java 接受锚点^$以匹配每行的开头和结尾(否则它们只匹配整个字符串的开头/结尾)。

Pattern.DOTALL or (?s) tells Java to allow the dot to match newline characters, too. Pattern.DOTALL(?s)告诉 Java 也允许点匹配换行符。

Second, in your case, the regex fails because you're using the matches() method which expects the regex to match the entire string - which of course doesn't work since there are some characters left after (\\\\W)*(\\\\S)* have matched.其次,在您的情况下,正则表达式失败,因为您使用的是matches()方法,该方法期望正则表达式匹配整个字符串 - 这当然不起作用,因为(\\\\W)*(\\\\S)*匹配。

So if you're simply looking for a string that starts with User Comments: , use the regex因此,如果您只是在寻找以User Comments:开头的字符串,请使用正则表达式

^\s*User Comments:\s*(.*)

with the Pattern.DOTALL option:使用Pattern.DOTALL选项:

Pattern regex = Pattern.compile("^\\s*User Comments:\\s+(.*)", Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
    ResultString = regexMatcher.group(1);
} 

ResultString will then contain the text after User Comments: ResultString将包含User Comments:后的文本User Comments:

This has nothing to do with the MULTILINE flag;这与 MULTILINE 标志无关; what you're seeing is the difference between the find() and matches() methods.您看到的是find()matches()方法之间的区别。 find() succeeds if a match can be found anywhere in the target string , while matches() expects the regex to match the entire string .如果可以在目标字符串中的任何位置找到匹配项,则find()成功,而matches()期望正则表达式匹配整个字符串

Pattern p = Pattern.compile("xyz");

Matcher m = p.matcher("123xyzabc");
System.out.println(m.find());    // true
System.out.println(m.matches()); // false

Matcher m = p.matcher("xyz");
System.out.println(m.matches()); // true

Furthermore, MULTILINE doesn't mean what you think it does.此外, MULTILINE并不意味着您认为它的作用。 Many people seem to jump to the conclusion that you have to use that flag if your target string contains newlines--that is, if it contains multiple logical lines.许多人似乎得出结论,如果目标字符串包含换行符——也就是说,如果它包含多个逻辑行,则必须使用该标志。 I've seen several answers here on SO to that effect, but in fact, all that flag does is change the behavior of the anchors, ^ and $ .我已经在 SO 上看到了几个答案,但实际上,该标志所做的只是改变了锚点^$的行为。

Normally ^ matches the very beginning of the target string, and $ matches the very end (or before a newline at the end, but we'll leave that aside for now).通常^匹配目标字符串的开头,而$匹配结尾(或末尾的换行符之前,但我们暂时将其放在一边)。 But if the string contains newlines, you can choose for ^ and $ to match at the start and end of any logical line, not just the start and end of the whole string, by setting the MULTILINE flag.但是,如果字符串包含换行符,您可以通过设置 MULTILINE 标志来选择^$在任何逻辑行的开头和结尾进行匹配,而不仅仅是整个字符串的开头和结尾。

So forget about what MULTILINE means and just remember what it does : changes the behavior of the ^ and $ anchors.所以忘记MULTILINE意思,只记住它的作用:改变^$锚点的行为。 DOTALL mode was originally called "single-line" (and still is in some flavors, including Perl and .NET), and it has always caused similar confusion. DOTALL模式最初被称为“单行”(现在仍然有一些风格,包括 Perl 和 .NET),它总是引起类似的混乱。 We're fortunate that the Java devs went with the more descriptive name in that case, but there was no reasonable alternative for "multiline" mode.我们很幸运,在这种情况下,Java 开发人员使用了更具描述性的名称,但是“多行”模式没有合理的替代方案。

In Perl, where all this madness started, they've admitted their mistake and gotten rid of both "multiline" and "single-line" modes in Perl 6 regexes.在 Perl 中,所有这些疯狂都开始了,他们承认了他们的错误并摆脱了 Perl 6 正则表达式中的“多行”和“单行”模式。 In another twenty years, maybe the rest of the world will have followed suit.再过二十年,也许世界其他地方也会效仿。

str.matches(regex) behaves like Pattern.matches(regex, str) which attempts to match the entire input sequence against the pattern and returns str.matches(regex) 行为类似于Pattern.matches(regex, str)尝试将整个输入序列与模式匹配并返回

true if, and only if, the entire input sequence matches this matcher's pattern当且仅当整个输入序列匹配此匹配器的模式时才为true

Whereas matcher.find() attempts to find the next subsequence of the input sequence that matches the pattern and returnsmatcher.find() 试图找到与模式匹配的输入序列的下一个子序列并返回

true if, and only if, a subsequence of the input sequence matches this matcher's pattern true当且仅当输入序列的子序列此匹配的模式匹配

Thus the problem is with the regex.因此问题出在正则表达式上。 Try the following.请尝试以下操作。

String test = "User Comments: This is \t a\ta \ntest\n\n message \n";

String pattern1 = "User Comments: [\\s\\S]*^test$[\\s\\S]*";
Pattern p = Pattern.compile(pattern1, Pattern.MULTILINE);
System.out.println(p.matcher(test).find());  //true

String pattern2 = "(?m)User Comments: [\\s\\S]*^test$[\\s\\S]*";
System.out.println(test.matches(pattern2));  //true

Thus in short, the (\\\\W)*(\\\\S)* portion in your first regex matches an empty string as * means zero or more occurrences and the real matched string is User Comments: and not the whole string as you'd expect.因此简而言之,第一个正则表达式中的(\\\\W)*(\\\\S)*部分与空字符串匹配,因为*表示出现零次或多次,而真正匹配的字符串是User Comments:而不是整个字符串,因为您' d 期待。 The second one fails as it tries to match the whole string but it can't as \\\\W matches a non word character, ie [^a-zA-Z0-9_] and the first character is T , a word character.第二个失败,因为它试图匹配整个字符串,但它不能,因为\\\\W匹配非单词字符,即[^a-zA-Z0-9_]并且第一个字符是T ,一个单词字符。

多行标志告诉正则表达式将模式匹配到每一行而不是整个字符串,为了您的目的,通配符就足够了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM