简体   繁体   English

负前瞻正则表达式在Java中不起作用

[英]Negative lookahead regex not working in Java

The following regex successfully works when testing here , but when I try to implement it into my Java code, it won't return a match. 以下正则表达式在此处进行测试时可以成功运行,但是当我尝试将其实现到Java代码中时,它将不会返回匹配项。 It uses a negative lookahead to ensure no newlines occur between MAIN LEVEL and Bedrooms . 它使用否定的前瞻性来确保MAIN LEVELBedrooms之间不会出现换行符。 Why won't it work in Java? 为什么在Java中不起作用?

regex 正则表达式

^\\s*\\bMAIN LEVEL\\b\\n(?:(?!\\n\\n)[\\s\\S])*\\bBedrooms:\\s*(.*)

Java 爪哇

pattern = Pattern.compile("^\\s*\\bMAIN LEVEL\\b\\n(?:(?!\\n\\n)[\\s\\S])*\\bBedrooms:\\s*(.*)");
    match = pattern.matcher(content);      
    if(match.find())
    {
        //Doesn't reach here
        String bed = match.group(1);
        bed = bed.trim();
    }

content is just a string read from a text file, which contains the exact text shown in the demo linked above. content只是从文本文件读取的字符串,其中包含上面链接的演示中显示的确切文本。

File file = new File("C:\\Users\\ME\\Desktop\\content.txt"); 
 content = new Scanner(file).useDelimiter("\\Z").next();

UPDATE: 更新:

I changed my code to include a multiline modifier (?m) , but it prints out "null". 我更改了代码以包含多行修饰符(?m) ,但它打印出“ null”。

pattern = Pattern.compile("(?m)^\\s*\\bMAIN LEVEL\\b\\n(?:(?!\\n\\n)[\\s\\S])*\\bBedrooms:\\s*(.*)");
    match = pattern.matcher(content);
    if(match.find())
    {   // Still not reaching here
        mainBeds=match.group(1);
        mainBeds= mainBeds.trim();
    }
  System.out.println(mainBeds);     // Prints null

The problem: 问题:

As explained in Alan Moore's answer , it's a mismatch between the format of the Line-Separators used in your file ( \\r\\n ), and what your pattern is specifying ( \\n ): 正如艾伦·摩尔(Alan Moore)的回答所述 ,文件中使用的Line-Separators格式( \\r\\n )与模式指定的内容( \\n )不匹配:

Original code: 原始代码:
Pattern.compile("^\\\\s*\\\\bMAIN LEVEL\\\\b \\\\n (?:(?! \\\\n\\\\n )[\\\\s\\\\S])*\\\\bBedrooms:\\\\s*(.*)"); Pattern.compile("^\\\\s*\\\\bMAIN LEVEL\\\\b \\\\n (?:(?! \\\\n\\\\n )[\\\\s\\\\S])*\\\\bBedrooms:\\\\s*(.*)");

Note: I explain what the \\r and \\n represent, and the context and difference between \\r\\n and \\n , in the second item of the "side notes" section. 注意:我将在“附带说明”部分的第二项中解释\\r\\n代表什么,以及\\r\\n\\n之间的上下文和差异。


The solution(s): 解决方案:

  1. Most/all Java versions: 大多数/所有Java版本:
    You can use \\r?\\n to match both formats, and this is sufficient in most cases . 您可以使用\\r?\\n匹配两种格式,这在大多数情况下足够了

  2. Most/all Java versions: 大多数/所有Java版本:
    You can use \ \ |[\ \ \ \ \…\
\
] to match "Any Unicode linebreak sequence" . 您可以使用\ \ |[\ \ \ \ \…\
\
]匹配“任何Unicode \ \ |[\ \ \ \ \…\
\
] 序列”

  3. Java 8 and later: Java 8及更高版本:
    You can use the Linebreak Matcher ( \\R ) . 您可以使用换行匹配器( \\R It is equivalent to the second method (above), and whenever possible (Java 8 or later), this is the recommended method . 它等效于上面的第二种方法,并且在可能的情况下(Java 8或更高版本), 这是推荐的方法

Resulting code (3rd method): 结果代码(第三种方法):
Pattern.compile("^\\\\s*\\\\bMAIN LEVEL\\\\b \\\\R (?:(?! \\\\R\\\\R )[\\\\s\\\\S])*\\\\bBedrooms:\\\\s*(.*)"); Pattern.compile("^\\\\s*\\\\bMAIN LEVEL\\\\b \\\\R (?:(?! \\\\R\\\\R )[\\\\s\\\\S])*\\\\bBedrooms:\\\\s*(.*)");


Side notes: 旁注:

  1. You can replace \\\\R\\\\R with \\\\R{2} , which is more readable. 您可以将\\\\R\\\\R替换为\\\\R{2} ,这样更易​​读。

  2. Different formats of line-breaks exist and are used in different systems because early OSs inherited the "line-break logic" from mechanical typing machines, like typewriters. 由于早期的OS从机械打字机(如打字机)继承了“换行逻辑”,因此存在不同格式的换行符并在不同系统中使用。

    The \\r in code represents a Carriage-Return , aka CR . 代码中的\\r表示回车符 ,也称为CR The idea behind this is to return the typing cursor to the start of the line. 其背后的想法是将键入光标返回到行的开头。

    The \\n in code represents a Line-Feed , aka LF . 代码中的\\n表示Line-Feed ,也就是LF The idea behind this is to move the typing cursor to the next line. 其背后的想法是将输入光标移动到下一行。

    The most common line-break formats are CR-LF ( \\r\\n ), used primarily by Windows; 最常见的换行格式是CR-LF\\r\\n ),主要由Windows使用; and LF ( \\n ), used by most UNIX-like systems. LF\\n ),由大多数类似UNIX的系统使用。 This is the reason why " \\r?\\n will be sufficient in most cases" , and you can reliably use it for systems intended for household-grade users. 这就是为什么\\r?\\n在大多数情况下就足够了”的原因 ,并且您可以将其可靠地用于家庭级用户的系统。

    However , some (rare) OSs, usually in industrial-grade stuff such as servers, may use CR , LF-CR , or something else entirely, which is why the second method has so many characters in it, so if you need the code to be compatible with every system, `you will need the second, or preferably, the third method. 但是 ,某些(罕见)的OS(通常在服务器等工业级设备中)可能会使用CRLF-CR或其他完全使用的东西,这就是为什么第二种方法中包含这么多字符的原因,因此如果需要代码为了与每个系统兼容,`您将需要第二种方法,或者最好是第三种方法。

  3. Here is a useful method for testing where your patterns are failing: 这是测试模式失败的有用方法:

     String content = "..."; //Replace "..." with your content. String patternString = "..."; //Replace "..." with your pattern. String lastPatternSuccess = "None. You suck at Regex!"; for (int i = 0; i <= patternString.length(); i++) { try { String patternSubstring = patternString.substring(0, i); Pattern pattern = Pattern.compile(patternSubstring); Matcher matcher = pattern.matcher(content); if (matcher.find()) { lastPatternSuccess = i + " - Pattern: " + patternSubstring + " - Match: \\n" + matcher.group(); } } catch (Exception ex) { //Ignore and jump to next } } System.out.println(lastPatternSuccess); 

It's the line separators. 是行分隔符。 You're looking for \\n , but your file actually uses \\r\\n . 您正在寻找\\n ,但是您的文件实际上使用\\r\\n If you're running Java 8, you can change every \\\\n in your code to \\\\R (the universal line separator). 如果您运行的是Java 8,则可以将代码中的每个\\\\n更改为\\\\R (通用行分隔符)。 For Java 7 or earlier, use \\\\r?\\\\n . 对于Java 7或更早版本,请使用\\\\r?\\\\n

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM