正则表达式匹配多行中的字符串模式

Question

I am trying to extract text from pdf. 我试图从pdf中提取文本。 But the extracted text is not in order. 但提取的文本不符合要求。 So i am writing regex to extract and use it. 所以我正在编写正则表达式来提取和使用它。 As i am new in writing regex and with handling multilines in text i am facing issues. 由于我是新写的正则表达式，并且在文本中处理多行，我面临着问题。 Could some one help. 可能有人帮忙。 The String text is like this: stringtext = 0,10 - 0,20 0,30 - 0,40, 0,50 - 0,60 (Line 1) A (Line 2) / (Line 3) B (Line 4) / (Line 5) C (Line 6) / (Line 7) D (Line 8) / (Line 9) String文本是这样的：stringtext = 0,10 - 0,20 0,30 - 0,40,0,50 - 0,60（第1行）A（第2行）/（第3行）B（第4行） /（第5行）C（第6行）/（第7行）D（第8行）/（第9行）

My aim to extract only ABCD from string text. 我的目标是从字符串文本中仅提取ABCD。 Could some one help. 可能有人帮忙。 Thanks! 谢谢！

I tried researching but i am not able to find a solution that suits me. 我尝试过研究，但我找不到适合我的解决方案。

    stringtext = 0,10 - 0,20 0,30 - 0,40, 0,50 - 0,60
                 A
                 /
                 B
                 /
                 C
                 /
                 D
                 /;
   Pattern pattern = pattern.compile(".*\\r\\n(\\_.*)$");
   Matcher matcher = pattern.matcher(stringtext);
   if(matcher.find()){
    System.out.println(matcher.group(1);
   }

Expected output should be ABCD 预期产出应为ABCD

Answer 1

If you want to use .* to match the first line, you might make the match a bit more specific by starting it with for example the pattern of the first number. 如果你想使用.*来匹配第一行，你可以通过例如第一个数字的模式启动它来使匹配更具体一些。

You could make use of the \\G anchor to get repetitive matches and match the uppercase characters in a capturing group. 您可以使用\\G锚点来获取重复匹配并匹配捕获组中的大写字符。

(?:^\d+,\d+.*|\G(?!^))\R\h+([A-Z])\R.*\/

Explanation 说明

(?: Non capturing group (?:非捕获组
- ^\\d+,\\d+.* Match from the start of the string 1+ digits, comma and 1+ digits ^\\d+,\\d+.*从字符串1+位，逗号和1+位开头匹配
- | Or 要么
- \\G(?!^) Assert position at the end of the previous match, not at the start \\G(?!^)在上一场比赛结束时断言位置，而不是在开始时
) Close non capturing group )关闭非捕获组
\\R\\h+ Match unicode newline sequence and 1+ horizontal whitespace characters \\R\\h+匹配unicode换行符序列和1+个水平空白字符
([AZ]) Capture an uppercase char in group 1 ([AZ])捕获组1中的大写字符
\\R.*\\/ Match unicode newline sequence, any char except newline 0+ times and a forward slash. \\R.*\\/匹配unicode换行符序列，除换行符0次以外的任何字符和正斜杠。

Regex demo | 正则表达式演示 | Java demo Java演示

For example: 例如：

String regex = "(?:^\\d+,\\d+.*|\\G(?!^))\\R\\h+([A-Z])\\R.*\\/";
String stringtext = "0,10 - 0,20 0,30 - 0,40, 0,50 - 0,60\n"
     + "                     A\n"
     + "                     /\n"
     + "                     B\n"
     + "                     /\n"
     + "                     C\n"
     + "                     /\n"
     + "                     D\n"
     + "                     /;";

Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(stringtext);

while (matcher.find()) {
    System.out.println(matcher.group(1));
}

Result 结果

A
B
C
D

正则表达式匹配多行中的字符串模式

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-06-17 23:10:46

正则表达式匹配多行中的字符串模式

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-06-17 23:10:46

解决方案1
0 已采纳 2019-06-17 23:10:46