Python 正则表达式用于匹配两个子字符串之间的任意数量的元素？

Question

我正在尝试编写一个正则表达式来查找起始标记（'MS' 或 'PhD'）和结束标记（'.' 或 '.'）之间的所有字符，这很棘手是因为它对两者都很常见开始标记出现在我的文本数据中。 我只对最后一个起始标记和第一个结束标记所界定的字符感兴趣。 （以及所有此类事件。）

start = 'MS|PhD'
end = '.|!'

input1 = "Candidate with MS or PhD in Statistics, Computer Science, or similar field."
output1 = "in Statistics, Computer Science, or similar field"

input2 = "Applicant with MS in Biology or Chemistry desired."
output2 = "in Biology or Chemistry desired"

这是我最好的尝试，目前返回一个空列表：

#          start  any char    end
pattern = r'^(MS|PhD) .* (\.|!)$'
re.findall(pattern,"candidate with MS in Chemistry.")

>>>
[]

有人能指出我正确的方向吗？

Answer 1

您可以使用捕获组并匹配 MS 或 PhD 和。 或者。 组外。

\b(?:MS|PhD)\s*((?:(?!\b(?:MS|PhD)\b).)*)[.,]

\b(?:MS|PhD)\s*一个单词边界，匹配 MS 或 phD 后跟 0+ 前导 whitspace 字符，以不在组中捕获它们
(捕获组 1 ，其中包含所需的值
- (?:非捕获组
  - (??\b(:.MS|PhD)\b). 如果后面没有 MS 或 phD，则匹配除换行符以外的任何字符
- )*关闭非捕获组并重复 0+ 次
)[.,]关闭第 1 组并匹配. 或,

正则表达式演示| Python 演示

import re

regex = r"\b(?:MS|PhD)\s*((?:(?!\b(?:MS|PhD)\b).)*)[.,]"
s = ("Candidate with MS or PhD in Statistics, Computer Science, or similar field.\n"
    "Applicant with MS in Biology or Chemistry desired.")

matches = re.findall(regex, s)
print(matches)

Output

['in Statistics, Computer Science, or similar field', 'in Biology or Chemistry desired']

Python 正则表达式用于匹配两个子字符串之间的任意数量的元素？

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-12-21 18:08:10

Python 正则表达式用于匹配两个子字符串之间的任意数量的元素？

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-12-21 18:08:10

解决方案1
2 已采纳 2020-12-21 18:08:10