简体   繁体   中英

Python regex for matching arbitrary number of elements between 2 substrings?

I'm trying to write a regex which finds all characters between a starting token ('MS' or 'PhD') and an ending token ('.' or '.'), What makes this tricky is that it's fairly common for both starting tokens to be present in my text data. I'm only interested in the characters bounded by the last starting token and first ending token. (And all such occurrences.)

start = 'MS|PhD'
end = '.|!'

input1 = "Candidate with MS or PhD in Statistics, Computer Science, or similar field."
output1 = "in Statistics, Computer Science, or similar field"

input2 = "Applicant with MS in Biology or Chemistry desired."
output2 = "in Biology or Chemistry desired"

Here's my best attempt, which is currently returning an empty list:

#          start  any char    end
pattern = r'^(MS|PhD) .* (\.|!)$'
re.findall(pattern,"candidate with MS in Chemistry.")

>>>
[]

Could someone point me in the right direction?

You could use a capturing group and match MS or PhD and the. or. outside of the group.

\b(?:MS|PhD)\s*((?:(?!\b(?:MS|PhD)\b).)*)[.,]
  • \b(?:MS|PhD)\s* A word boundary, match either MS or phD followed by 0+ leading whitspace chars to not capture them in the group
  • ( capture group 1 , which contains the desired value
    • (?: Non capture group
      • (??\b(:.MS|PhD)\b). Match any char except a newline if it is not followed by either MS or phD
    • )* Close the non capture group and repeat it 0+ times
  • )[.,] Close group 1 and match either . or ,

Regex demo | Python demo

import re

regex = r"\b(?:MS|PhD)\s*((?:(?!\b(?:MS|PhD)\b).)*)[.,]"
s = ("Candidate with MS or PhD in Statistics, Computer Science, or similar field.\n"
    "Applicant with MS in Biology or Chemistry desired.")

matches = re.findall(regex, s)
print(matches)

Output

['in Statistics, Computer Science, or similar field', 'in Biology or Chemistry desired']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM