[英]Negative lookahead regex not working in Java
The following regex successfully works when testing here , but when I try to implement it into my Java code, it won't return a match. 以下正则表达式在此处进行测试时可以成功运行,但是当我尝试将其实现到Java代码中时,它将不会返回匹配项。 It uses a negative lookahead to ensure no newlines occur between
MAIN LEVEL
and Bedrooms
. 它使用否定的前瞻性来确保
MAIN LEVEL
和Bedrooms
之间不会出现换行符。 Why won't it work in Java? 为什么在Java中不起作用?
regex 正则表达式
^\\s*\\bMAIN LEVEL\\b\\n(?:(?!\\n\\n)[\\s\\S])*\\bBedrooms:\\s*(.*)
Java 爪哇
pattern = Pattern.compile("^\\s*\\bMAIN LEVEL\\b\\n(?:(?!\\n\\n)[\\s\\S])*\\bBedrooms:\\s*(.*)");
match = pattern.matcher(content);
if(match.find())
{
//Doesn't reach here
String bed = match.group(1);
bed = bed.trim();
}
content
is just a string read from a text file, which contains the exact text shown in the demo linked above. content
只是从文本文件读取的字符串,其中包含上面链接的演示中显示的确切文本。
File file = new File("C:\\Users\\ME\\Desktop\\content.txt");
content = new Scanner(file).useDelimiter("\\Z").next();
UPDATE: 更新:
I changed my code to include a multiline modifier (?m)
, but it prints out "null". 我更改了代码以包含多行修饰符
(?m)
,但它打印出“ null”。
pattern = Pattern.compile("(?m)^\\s*\\bMAIN LEVEL\\b\\n(?:(?!\\n\\n)[\\s\\S])*\\bBedrooms:\\s*(.*)");
match = pattern.matcher(content);
if(match.find())
{ // Still not reaching here
mainBeds=match.group(1);
mainBeds= mainBeds.trim();
}
System.out.println(mainBeds); // Prints null
As explained in Alan Moore's answer , it's a mismatch between the format of the Line-Separators
used in your file ( \\r\\n
), and what your pattern is specifying ( \\n
): 正如艾伦·摩尔(Alan Moore)的回答所述 ,文件中使用的
Line-Separators
格式( \\r\\n
)与模式指定的内容( \\n
)不匹配:
Original code: 原始代码:
Pattern.compile("^\\\\s*\\\\bMAIN LEVEL\\\\b
\\\\n
(?:(?!
\\\\n\\\\n
)[\\\\s\\\\S])*\\\\bBedrooms:\\\\s*(.*)");
Pattern.compile("^\\\\s*\\\\bMAIN LEVEL\\\\b
\\\\n
(?:(?!
\\\\n\\\\n
)[\\\\s\\\\S])*\\\\bBedrooms:\\\\s*(.*)");
Note: I explain what the \\r
and \\n
represent, and the context and difference between \\r\\n
and \\n
, in the second item of the "side notes" section. 注意:我将在“附带说明”部分的第二项中解释
\\r
和\\n
代表什么,以及\\r\\n
和\\n
之间的上下文和差异。
Most/all Java versions: 大多数/所有Java版本:
You can use \\r?\\n
to match both formats, and this is sufficient in most cases . 您可以使用
\\r?\\n
匹配两种格式,这在大多数情况下就足够了 。
Most/all Java versions: 大多数/所有Java版本:
You can use \ \ |[\ \\ \ \
\
\
]
to match "Any Unicode linebreak sequence" . 您可以使用
\ \ |[\ \\ \ \
\
\
]
匹配“任何Unicode \ \ |[\ \\ \ \
\
\
]
序列” 。
Java 8 and later: Java 8及更高版本:
You can use the Linebreak Matcher ( \\R
) . 您可以使用换行匹配器(
\\R
) 。 It is equivalent to the second method (above), and whenever possible (Java 8 or later), this is the recommended method . 它等效于上面的第二种方法,并且在可能的情况下(Java 8或更高版本), 这是推荐的方法 。
Resulting code (3rd method): 结果代码(第三种方法):
Pattern.compile("^\\\\s*\\\\bMAIN LEVEL\\\\b
\\\\R
(?:(?!
\\\\R\\\\R
)[\\\\s\\\\S])*\\\\bBedrooms:\\\\s*(.*)");
Pattern.compile("^\\\\s*\\\\bMAIN LEVEL\\\\b
\\\\R
(?:(?!
\\\\R\\\\R
)[\\\\s\\\\S])*\\\\bBedrooms:\\\\s*(.*)");
You can replace \\\\R\\\\R
with \\\\R{2}
, which is more readable. 您可以将
\\\\R\\\\R
替换为\\\\R{2}
,这样更易读。
Different formats of line-breaks exist and are used in different systems because early OSs inherited the "line-break logic" from mechanical typing machines, like typewriters. 由于早期的OS从机械打字机(如打字机)继承了“换行逻辑”,因此存在不同格式的换行符并在不同系统中使用。
The \\r
in code represents a Carriage-Return , aka CR
. 代码中的
\\r
表示回车符 ,也称为CR
。 The idea behind this is to return the typing cursor to the start of the line. 其背后的想法是将键入光标返回到行的开头。
The \\n
in code represents a Line-Feed , aka LF
. 代码中的
\\n
表示Line-Feed ,也就是LF
。 The idea behind this is to move the typing cursor to the next line. 其背后的想法是将输入光标移动到下一行。
The most common line-break formats are CR-LF
( \\r\\n
), used primarily by Windows; 最常见的换行格式是
CR-LF
( \\r\\n
),主要由Windows使用; and LF
( \\n
), used by most UNIX-like systems. 和
LF
( \\n
),由大多数类似UNIX的系统使用。 This is the reason why " \\r?\\n
will be sufficient in most cases" , and you can reliably use it for systems intended for household-grade users. 这就是为什么“
\\r?\\n
在大多数情况下就足够了”的原因 ,并且您可以将其可靠地用于家庭级用户的系统。
However , some (rare) OSs, usually in industrial-grade stuff such as servers, may use CR
, LF-CR
, or something else entirely, which is why the second method has so many characters in it, so if you need the code to be compatible with every system, `you will need the second, or preferably, the third method. 但是 ,某些(罕见)的OS(通常在服务器等工业级设备中)可能会使用
CR
, LF-CR
或其他完全使用的东西,这就是为什么第二种方法中包含这么多字符的原因,因此如果需要代码为了与每个系统兼容,`您将需要第二种方法,或者最好是第三种方法。
Here is a useful method for testing where your patterns are failing: 这是测试模式失败的有用方法:
String content = "..."; //Replace "..." with your content. String patternString = "..."; //Replace "..." with your pattern. String lastPatternSuccess = "None. You suck at Regex!"; for (int i = 0; i <= patternString.length(); i++) { try { String patternSubstring = patternString.substring(0, i); Pattern pattern = Pattern.compile(patternSubstring); Matcher matcher = pattern.matcher(content); if (matcher.find()) { lastPatternSuccess = i + " - Pattern: " + patternSubstring + " - Match: \\n" + matcher.group(); } } catch (Exception ex) { //Ignore and jump to next } } System.out.println(lastPatternSuccess);
It's the line separators. 是行分隔符。 You're looking for
\\n
, but your file actually uses \\r\\n
. 您正在寻找
\\n
,但是您的文件实际上使用\\r\\n
。 If you're running Java 8, you can change every \\\\n
in your code to \\\\R
(the universal line separator). 如果您运行的是Java 8,则可以将代码中的每个
\\\\n
更改为\\\\R
(通用行分隔符)。 For Java 7 or earlier, use \\\\r?\\\\n
. 对于Java 7或更早版本,请使用
\\\\r?\\\\n
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.