简体   繁体   English

多行正则表达式模式后解析文本行

[英]Parse line of text after multiline regex pattern

I am attempting to parse fields from a pdf file converted to txt via pdfbox. 我正在尝试解析通过pdfbox转换为txt的pdf文件中的字段。 Here is an example of a field I need to extract, "BUYER NAME AND ADDRESS:". 这是我需要提取的字段的示例,“买方名称和地址:”。 These documents often contain translations, and the ":" colon appears a variable number of characters after BUYER NAME AND ADDRESS. 这些文档通常包含翻译,“:”冒号在“买方名称和地址”之后出现可变数量的字符。 Example below. 下面的例子。

Txt file.. txt文件..
BUYER NAME AND ADDRESS / NOMBRE Y 买方名称和地址/名称是
DIRECCIÓN DEL COMPRADOR: DIRECCIÓNDEL补偿人:
Name of buyer here 买家姓名在这里
Txt continues.. Txt继续。

Here is my attempted pattern / scanning code. 这是我尝试的图案/扫描代码。

Scanner sc = new Scanner(txtFile);
Pattern p = Pattern.compile("BUYER NAME AND ADDRESS.*:", Pattern.MULTILINE);
sc.findWithinHorizon(p, 0);
String buyer = sc.nextLine();
buyer = sc.nextLine();
System.out.println("Buyer Name: "+buyer);

This works when the text file is english only eg BUYER NAME AND ADDRESS: but if there are additional characters or line returns, it fails. 当文本文件仅是英语时,例如买方名称和地址,此方法有效:但是,如果还有其他字符或换行符,它将失败。 How can I fix the pattern? 如何修复图案?

The given regex "BUYER NAME AND ADDRESS.*:" matches "BUYER NAME AND ADDRESS" followed by any amount of characters followed by a colon, so this will match everything until the last colon because regex are greedy, you could use .*? 给定的正则表达式"BUYER NAME AND ADDRESS.*:"匹配“ BUYER NAME AND ADDRESS”,后跟任意数量的字符,后跟冒号,因此这将匹配所有字符,直到最后一个冒号为止,因为regex是贪婪的,您可以使用.*? (non-greedy) to get the desired behavior. (非贪婪)以获得所需的行为。 Additionally you need to change MULTILINE ( ^ and $ matches start and end of line) to DOTALL ( . also matches newlines) to make this work as @stribizhev said. 另外,您需要将MULTILINE( ^$匹配行的开头和结尾)更改为DOTALL( .也匹配换行符),以便如@stribizhev所述进行工作。

This can also be corrected by using [^:] , [^...] means not those characters. 也可以使用[^:]进行更正, [^...]表示不包含那些字符。 Like this you don't need any modifiers (I removed the : at the end because you probably don't need it if you do it like this): 像这样,您不需要任何修饰符(我在最后删除了: ,因为如果您这样做,则可能不需要它):

"BUYER NAME AND ADDRESS[^:]*"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM