简体   繁体   English

Python 正则表达式 A|B|C 匹配 C 即使 B 应该匹配

[英]Python regex A|B|C matches C even though B should match

I've been sitting on this problem for several hours now and I really don't know anymore... Essentially, I have an A|B|C - type separated regex and for whatever reason C matches over B, even though the individual regexes should be tested from left-to-right and stopped in a non-greedy fashion (ie once a match is found, the other regex' are not tested anymore).我已经坐在这个问题上几个小时了,我真的不知道了......本质上,我有一个 A|B|C - 类型分离的正则表达式,无论出于何种原因,C 匹配 B,即使个人正则表达式应该从左到右进行测试并以非贪婪的方式停止(即一旦找到匹配项,则不再测试其他正则表达式)。

This is my code:这是我的代码:

text = 'Patients with end stage heart failure fall into stage D of the ABCD classification of the American College of Cardiology (ACC)/American Heart Association (AHA), and class III–IV of the New York Heart Association (NYHA) functional classification; they are characterised by advanced structural heart disease and pronounced symptoms of heart failure at rest or upon minimal physical exertion, despite maximal medical treatment according to current guidelines.'
expansion = "American Heart Association"
re_exp = re.compile(expansion + "|" + r"(?<=\W)" + expansion + "|"\
                    + expansion.split()[0] + r"[-\s].*?\s*?" + expansion.split()[-1])

m = re_exp.search(text)
print(m.group(0))

I want regex to find the "expansion" string.我希望正则表达式找到“扩展”字符串。 In my dataset, sometimes the text has the expansion string slightly edited, for example having articles or prepositions like "for" or "the" between the main nouns.在我的数据集中,有时文本对扩展字符串进行了略微编辑,例如在主要名词之间有冠词或介词,如“for”或“the”。 This is why I first try to just match the String as is, then try to match it if it is after any non-word character (ie parentheses or, like in the example above, a whole lot of stuff as the space was omitted) and finally, I just go full wild-card to find the string by search for the beginning and ending of the string with wildcards inbetween.这就是为什么我首先尝试按原样匹配字符串,然后尝试匹配它是否在任何非单词字符之后(即括号,或者像上面的示例一样,因为空格被省略了很多东西)最后,我只是通过使用通配符搜索字符串的开头和结尾来查找字符串。

Either way, with the example above I would expect to get the followinging output:无论哪种方式,对于上面的示例,我希望得到以下 output:

American Heart Association

but what I'm getting is但我得到的是

American College of Cardiology (ACC)/American Heart Association

which is the match for the final regex.这是最终正则表达式的匹配项。

If I delete the final regex or just call re.findall(r"(?<=\W)"+ expansion, text) , I get the output I want, meaning the regex is in fact matching properly.如果我删除最终的正则表达式或只调用re.findall(r"(?<=\W)"+ expansion, text) ,我会得到我想要的 output ,这意味着正则表达式实际上匹配正确。

What gives?是什么赋予了?

So re.findall(r"(?<=\W)"+ expansion, text) works because before the match is a non-word character (denoted \w ), "/".所以re.findall(r"(?<=\W)"+ expansion, text)有效,因为在匹配之前是一个非单词字符(表示为\w ),“/”。 Your regex will match "American [whatever random stuff here] Heart Association".您的正则表达式将匹配“美国 [这里的任何随机内容] 心脏协会”。 This means you match "American College of Cardiology (ACC)/American Heart Association" before you will match the inner string "American Heart Association".这意味着在匹配内部字符串“American Heart Association”之前,先匹配“American College of Cardiology (ACC)/American Heart Association”。 Eg if you deleted the first "American" in your string you would get the match you are looking for with your regex.例如,如果您删除了字符串中的第一个“American”,您将使用正则表达式获得您正在寻找的匹配项。

You need to be more restrictive with your regex to rule out situations like these.您需要对正则表达式进行更多限制以排除此类情况。

The regex looks like this:正则表达式如下所示:

American Heart Association|(?<=\W)American Heart Association|American[-\s].*?\s*?Association

The first 2 alternatives match the same text, only the second one has a positive lookbehind prepended.前 2 个备选方案匹配相同的文本,只有第二个备选方案带有正面的后视。

You can omit that second alternative, as the first alternative without any assertions has either already matched it, or the second part will also not match it if the first one did not match it.您可以省略第二个替代方案,因为没有任何断言的第一个替代方案已经匹配它,或者如果第一个不匹配它,第二部分也将不匹配它。

As the pattern matches from left to right and encounters the first occurrence with American , the first and the second alternatives can not match American College of Cardiology .由于模式从左到右匹配并遇到第一次出现American ,因此第一个和第二个备选方案无法匹配American College of Cardiology

Then the third alternation can match it, and due to the .*?然后第三个交替可以匹配它,并且由于.*? it can stretch until the first occurrence of Association.它可以一直延伸到第一次出现关联。


What you might do is for example exclude possible characters to match using a negated character class :例如,您可能会使用否定字符 class排除可能匹配的字符:

\bAmerican\b[^/,.]*\bAssociation\b

Regex demo正则表达式演示

Or you might use a tempered greedy token approach to not allow specific words between the first and last part:或者您可以使用缓和的贪婪令牌方法来不允许在第一部分和最后一部分之间使用特定单词:

\bAmerican\b(?:(?!American\b|Association\b).)*\bHeart Association\b

Regex demo正则表达式演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM