简体   繁体   中英

Matching in regex all capital words between two words

I am creating a parser of a text. The text contains two specific words that I do not want to match and between them I want to capture all the capital words that exist.

For example the text would be:

Treatments: IBUPROFEN\n\xe2\x80\xa2 COLCHICINE .... Physical examination

I have tried with this (?<=Treatments)(?:.*?)(\\b[AZ]+\\b)(?:.*?)(?=Physical) but it didn't work.

I would like to capture just the words that are in capital letters between treatments and physical examination

To capture only the words that are in capital letters and between words begin and end , use this regex:

.*begin|end.*|[^e]*?\b([A-Z]{2,})\b

See online demo

When you replace end with some other word, be sure to replace e in [^e]*? part with the first letter of this new word, eg when you want to replace end with Stop , then also replace [^e]*? with [^S]*? .

For the example in question, this regex becomes:

.*Treatments|Physical examination.*|[^P]*?\b([A-Z]{2,})\b

See online demo

Note that you need to tell your regex engine to make . (dot) match newline character:

  • In Python it's re.DOTALL flag.
  • In JavaScript you must replace all . (dots) in regex with [\\s\\S] . [ source ]

Also note that the first and the last regex matches won't have anything in the first capture group, so you need to ignore those matches (see filter call in python example below).

Python example

import re

text = """Suspendisse potenti:
Not MATCHED here. Por TOG esfet.

Treatments:
Pellentesque eget sollicitudin quam, id venenatis odio. Nam non tortor elit. Pras ultricies est urna, eu feugiat purus tempor a. Donec IBUPROFEN feugiat tristique ante, eget vulputate velit rhoncus ut. Morbi MATCHED HERE elementum leo a vulputate cursus. Sed at purus sit amet sapien COLCHICINE ullamcorper convallis.

Physical examination:
Also NOT MATCHED here at TO pulvinar mi, at vehicula libero. Nunc semper, neque sed tempor iaculis, nunc diam egestas lacus, Peget sodales sapien orci eget leo."""
results = re.findall(r".*Treatments|Physical examination.*|[^P]*?\b([A-Z]{2,})\b", text, re.DOTALL)
words = list(filter(None, results))

print(words)

Run it

This seems to work in Java. Here's what is used.

  • ?msd multiline mode, dotall mode, and unix newline mode
  • \\b word boundary (need to do \\\\b for Java Strings)
  • (?<=) positive look behind
  • (?=) positive look ahead.
        String str =
                "Treatments: IBUPROFEN\n\\xe2\\x80\\xa2 COLCHICINE .... Physical examination";

         pat = "(?msd:(?<=Treatments:.*)\\b([A-Z]+)\\b(?=.*Physical examination))";
        // iterate until no matches found.
        Matcher m = Pattern.compile(pat).matcher(str);
        while(m.find()) {
            System.out.println(m.group(1));
        }

Prints

IBUPROFEN
COLCHICINE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM