简体   繁体   中英

Regex Help for text/pattern between two keywords

I am trying to extract text between two words. The below pattern repeats itself with modifications in between 'start keyword' and 'end keyword' across the text document. The document has paragraphs and text before and after the following patterns, which i don't want to extract. Can anyone help me with the regex for the following? which would extract all occurrences.

Start keyword- RIASWIX End keyword - Sky Access

----Document Start-------

Paragraph*

RIASWIX.*                                 ABCDEF1   NONE
   WORKING:  HELLO(READ)
   BOOLEAN Access:  SADGRE3, VJFKES3, JGJKEWW, IS4DWF44(A), DFEAWE2(G),
     DW4444W, IHFK3MF3
   BAZAAR Access:  No resource with BAZAAR Access
   GHAR Access:  No resource with GHAR Access
   WATER Access:  ADMINDDD(A), GEDDE33
   SKY None:  No Resource with Sky Access

RIASWIX.@7483NFJ.*                                 HFDFDF3   NONE
   WORKING:  BYE(READ)
   BOOLEAN Access:  GRREGGG, GREFEFF, GFGGGG, FDFDFDF(A), RERERE3(G),
     GFFWEF44, FFRF44F
   BAZAAR Access:  No resource with BAZAAR Access
   GHAR Access:  No resource with GHAR Access
   WATER Access:  ADMINEWW(A), FFRFRGR
   SKY None:  No Resource with Sky Access

RIASWIX.@7483KXX.*                                 HFDFDF3   NONE
   WORKING:  TATA(READ)
   BOOLEAN Access:  GRDSD33, FASDE, GFGGGG, RWERW33(A), NMUYHT4(G),
   BAZAAR Access:  XCDFEFE3, FREFE33R
   GHAR Access:  No resource with GHAR Access
   WATER Access:  DASDEFG(A), SJMFEIOE(P)
   SKY None:  No Resource with Sky Access

*Text

----Document End-------

(?s) for new line characters, check this regex-match-all-characters-between-two-strings

import re

print(re.findall('RIASWIX(?s)(.*?)Sky Access', str1))

Alternative regex:

"^RIASWIX.*?\\bSky Access\\b"

Regex in context and testbench:

public static void main(String[] args) {
    String input = getInput();

    Matcher matcher = Pattern
            .compile("^RIASWIX.*?\\bSky Access\\b", Pattern.MULTILINE | Pattern.DOTALL)
            .matcher(input);

    while(matcher.find()) {
        System.out.println("=== === === START === ==== ===");
        System.out.println(matcher.group());
        System.out.println("=== === === END === ==== ===\n");
    }
}

The input from document:

private static String getInput() {
    return "----Document Start-------\n" +
            "\n" +
            "Paragraph*\n" +
            "\n" +
            "RIASWIX.*                                 ABCDEF1   NONE\n" +
            "   WORKING:  HELLO(READ)\n" +
            "   BOOLEAN Access:  SADGRE3, VJFKES3, JGJKEWW, IS4DWF44(A), DFEAWE2(G),\n" +
            "     DW4444W, IHFK3MF3\n" +
            "   BAZAAR Access:  No resource with BAZAAR Access\n" +
            "   GHAR Access:  No resource with GHAR Access\n" +
            "   WATER Access:  ADMINDDD(A), GEDDE33\n" +
            "   SKY None:  No Resource with Sky Access\n" +
            "\n" +
            "RIASWIX.@7483NFJ.*                                 HFDFDF3   NONE\n" +
            "   WORKING:  BYE(READ)\n" +
            "   BOOLEAN Access:  GRREGGG, GREFEFF, GFGGGG, FDFDFDF(A), RERERE3(G),\n" +
            "     GFFWEF44, FFRF44F\n" +
            "   BAZAAR Access:  No resource with BAZAAR Access\n" +
            "   GHAR Access:  No resource with GHAR Access\n" +
            "   WATER Access:  ADMINEWW(A), FFRFRGR\n" +
            "   SKY None:  No Resource with Sky Access\n" +
            "\n" +
            "RIASWIX.@7483KXX.*                                 HFDFDF3   NONE\n" +
            "   WORKING:  TATA(READ)\n" +
            "   BOOLEAN Access:  GRDSD33, FASDE, GFGGGG, RWERW33(A), NMUYHT4(G),\n" +
            "   BAZAAR Access:  XCDFEFE3, FREFE33R\n" +
            "   GHAR Access:  No resource with GHAR Access\n" +
            "   WATER Access:  DASDEFG(A), SJMFEIOE(P)\n" +
            "   SKY None:  No Resource with Sky Access\n" +
            "\n" +
            "*Text\n" +
            "\n" +
            "----Document End-------";
}

Output:

=== === === START === ==== ===
RIASWIX.*                                 ABCDEF1   NONE
   WORKING:  HELLO(READ)
   BOOLEAN Access:  SADGRE3, VJFKES3, JGJKEWW, IS4DWF44(A), DFEAWE2(G),
     DW4444W, IHFK3MF3
   BAZAAR Access:  No resource with BAZAAR Access
   GHAR Access:  No resource with GHAR Access
   WATER Access:  ADMINDDD(A), GEDDE33
   SKY None:  No Resource with Sky Access
=== === === END === ==== ===

=== === === START === ==== ===
RIASWIX.@7483NFJ.*                                 HFDFDF3   NONE
   WORKING:  BYE(READ)
   BOOLEAN Access:  GRREGGG, GREFEFF, GFGGGG, FDFDFDF(A), RERERE3(G),
     GFFWEF44, FFRF44F
   BAZAAR Access:  No resource with BAZAAR Access
   GHAR Access:  No resource with GHAR Access
   WATER Access:  ADMINEWW(A), FFRFRGR
   SKY None:  No Resource with Sky Access
=== === === END === ==== ===

=== === === START === ==== ===
RIASWIX.@7483KXX.*                                 HFDFDF3   NONE
   WORKING:  TATA(READ)
   BOOLEAN Access:  GRDSD33, FASDE, GFGGGG, RWERW33(A), NMUYHT4(G),
   BAZAAR Access:  XCDFEFE3, FREFE33R
   GHAR Access:  No resource with GHAR Access
   WATER Access:  DASDEFG(A), SJMFEIOE(P)
   SKY None:  No Resource with Sky Access
=== === === END === ==== ===

You added to your question Python and Java as tags. I can answer you regarding Java.

Regex implementation:

  • If you need to exclude the keywords at the beginning and at the end of every matched occurrence, you need to use a positive lookbehind and a positive lookahead to match and exclude RIASWIX and Sky Access .

  • Then, you should use a reluctant quantifier to only match the text in between a pair of keywords, or else you would match the whole text between the first and last keyword.

  • Finally, your regex should enable the DOTALL flag in order to match the text across multiple lines.

Implementation with keywords excluded

https://regex101.com/r/6Lnm5i/1

String text = "... your text to parse ....";

//Creating a regex with the DOTALL mode enabled. Eventually you could add the flag within your regex by adding at the beginning (?s)
Pattern regex = Pattern.compile("(?<=RIASWIX).*?(?=Sky Access)", Pattern.DOTALL);

//Creating a matcher built on your regex and the text to parse
Matcher matcher = regex.matcher(text);

//While there are still occurrences
while(matcher.find()){
    //Printing the occurrence
    System.out.println(matcher.group());
}

Implementation with keywords included

https://regex101.com/r/6RYTYf/1

String text = "... your text to parse ....";

//Creating a regex with the DOTALL mode enabled. Eventually you could add the flag within your regex by adding at the beginning (?s)
Pattern regex = Pattern.compile("RIASWIX.*?Sky Access", Pattern.DOTALL);

//Creating a matcher built on your regex and the text to parse
Matcher matcher = regex.matcher(text);

//While there are still occurrences
while(matcher.find()){
    //Printing the occurrence
    System.out.println(matcher.group());
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM