I am creating a parser of a text. The text contains two specific words that I do not want to match and between them I want to capture all the capital words that exist.
For example the text would be:
Treatments: IBUPROFEN\n\xe2\x80\xa2 COLCHICINE .... Physical examination
I have tried with this (?<=Treatments)(?:.*?)(\\b[AZ]+\\b)(?:.*?)(?=Physical)
but it didn't work.
I would like to capture just the words that are in capital letters between treatments and physical examination
To capture only the words that are in capital letters and between words begin
and end
, use this regex:
.*begin|end.*|[^e]*?\b([A-Z]{2,})\b
When you replace end
with some other word, be sure to replace e
in [^e]*?
part with the first letter of this new word, eg when you want to replace end
with Stop
, then also replace [^e]*?
with [^S]*?
.
For the example in question, this regex becomes:
.*Treatments|Physical examination.*|[^P]*?\b([A-Z]{2,})\b
Note that you need to tell your regex engine to make .
(dot) match newline character:
re.DOTALL
flag..
(dots) in regex with [\\s\\S]
. [ source ] Also note that the first and the last regex matches won't have anything in the first capture group, so you need to ignore those matches (see filter
call in python example below).
import re
text = """Suspendisse potenti:
Not MATCHED here. Por TOG esfet.
Treatments:
Pellentesque eget sollicitudin quam, id venenatis odio. Nam non tortor elit. Pras ultricies est urna, eu feugiat purus tempor a. Donec IBUPROFEN feugiat tristique ante, eget vulputate velit rhoncus ut. Morbi MATCHED HERE elementum leo a vulputate cursus. Sed at purus sit amet sapien COLCHICINE ullamcorper convallis.
Physical examination:
Also NOT MATCHED here at TO pulvinar mi, at vehicula libero. Nunc semper, neque sed tempor iaculis, nunc diam egestas lacus, Peget sodales sapien orci eget leo."""
results = re.findall(r".*Treatments|Physical examination.*|[^P]*?\b([A-Z]{2,})\b", text, re.DOTALL)
words = list(filter(None, results))
print(words)
This seems to work in Java. Here's what is used.
?msd
multiline mode, dotall mode, and unix newline mode \\b
word boundary (need to do \\\\b
for Java Strings) (?<=)
positive look behind (?=)
positive look ahead. String str =
"Treatments: IBUPROFEN\n\\xe2\\x80\\xa2 COLCHICINE .... Physical examination";
pat = "(?msd:(?<=Treatments:.*)\\b([A-Z]+)\\b(?=.*Physical examination))";
// iterate until no matches found.
Matcher m = Pattern.compile(pat).matcher(str);
while(m.find()) {
System.out.println(m.group(1));
}
Prints
IBUPROFEN
COLCHICINE
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.