简体   繁体   中英

Regex to match string pattern in multi lines

I am trying to extract text from pdf. But the extracted text is not in order. So i am writing regex to extract and use it. As i am new in writing regex and with handling multilines in text i am facing issues. Could some one help. The String text is like this: stringtext = 0,10 - 0,20 0,30 - 0,40, 0,50 - 0,60 (Line 1) A (Line 2) / (Line 3) B (Line 4) / (Line 5) C (Line 6) / (Line 7) D (Line 8) / (Line 9)

My aim to extract only ABCD from string text. Could some one help. Thanks!

I tried researching but i am not able to find a solution that suits me.

    stringtext = 0,10 - 0,20 0,30 - 0,40, 0,50 - 0,60
                 A
                 /
                 B
                 /
                 C
                 /
                 D
                 /;
   Pattern pattern = pattern.compile(".*\\r\\n(\\_.*)$");
   Matcher matcher = pattern.matcher(stringtext);
   if(matcher.find()){
    System.out.println(matcher.group(1);
   }

Expected output should be ABCD

If you want to use .* to match the first line, you might make the match a bit more specific by starting it with for example the pattern of the first number.

You could make use of the \\G anchor to get repetitive matches and match the uppercase characters in a capturing group.

(?:^\d+,\d+.*|\G(?!^))\R\h+([A-Z])\R.*\/

Explanation

  • (?: Non capturing group
    • ^\\d+,\\d+.* Match from the start of the string 1+ digits, comma and 1+ digits
    • | Or
    • \\G(?!^) Assert position at the end of the previous match, not at the start
  • ) Close non capturing group
  • \\R\\h+ Match unicode newline sequence and 1+ horizontal whitespace characters
  • ([AZ]) Capture an uppercase char in group 1
  • \\R.*\\/ Match unicode newline sequence, any char except newline 0+ times and a forward slash.

Regex demo | Java demo

For example:

String regex = "(?:^\\d+,\\d+.*|\\G(?!^))\\R\\h+([A-Z])\\R.*\\/";
String stringtext = "0,10 - 0,20 0,30 - 0,40, 0,50 - 0,60\n"
     + "                     A\n"
     + "                     /\n"
     + "                     B\n"
     + "                     /\n"
     + "                     C\n"
     + "                     /\n"
     + "                     D\n"
     + "                     /;";

Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(stringtext);

while (matcher.find()) {
    System.out.println(matcher.group(1));
}

Result

A
B
C
D

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM