简体   繁体   中英

Python regex A|B|C matches C even though B should match

I've been sitting on this problem for several hours now and I really don't know anymore... Essentially, I have an A|B|C - type separated regex and for whatever reason C matches over B, even though the individual regexes should be tested from left-to-right and stopped in a non-greedy fashion (ie once a match is found, the other regex' are not tested anymore).

This is my code:

text = 'Patients with end stage heart failure fall into stage D of the ABCD classification of the American College of Cardiology (ACC)/American Heart Association (AHA), and class III–IV of the New York Heart Association (NYHA) functional classification; they are characterised by advanced structural heart disease and pronounced symptoms of heart failure at rest or upon minimal physical exertion, despite maximal medical treatment according to current guidelines.'
expansion = "American Heart Association"
re_exp = re.compile(expansion + "|" + r"(?<=\W)" + expansion + "|"\
                    + expansion.split()[0] + r"[-\s].*?\s*?" + expansion.split()[-1])

m = re_exp.search(text)
print(m.group(0))

I want regex to find the "expansion" string. In my dataset, sometimes the text has the expansion string slightly edited, for example having articles or prepositions like "for" or "the" between the main nouns. This is why I first try to just match the String as is, then try to match it if it is after any non-word character (ie parentheses or, like in the example above, a whole lot of stuff as the space was omitted) and finally, I just go full wild-card to find the string by search for the beginning and ending of the string with wildcards inbetween.

Either way, with the example above I would expect to get the followinging output:

American Heart Association

but what I'm getting is

American College of Cardiology (ACC)/American Heart Association

which is the match for the final regex.

If I delete the final regex or just call re.findall(r"(?<=\W)"+ expansion, text) , I get the output I want, meaning the regex is in fact matching properly.

What gives?

So re.findall(r"(?<=\W)"+ expansion, text) works because before the match is a non-word character (denoted \w ), "/". Your regex will match "American [whatever random stuff here] Heart Association". This means you match "American College of Cardiology (ACC)/American Heart Association" before you will match the inner string "American Heart Association". Eg if you deleted the first "American" in your string you would get the match you are looking for with your regex.

You need to be more restrictive with your regex to rule out situations like these.

The regex looks like this:

American Heart Association|(?<=\W)American Heart Association|American[-\s].*?\s*?Association

The first 2 alternatives match the same text, only the second one has a positive lookbehind prepended.

You can omit that second alternative, as the first alternative without any assertions has either already matched it, or the second part will also not match it if the first one did not match it.

As the pattern matches from left to right and encounters the first occurrence with American , the first and the second alternatives can not match American College of Cardiology .

Then the third alternation can match it, and due to the .*? it can stretch until the first occurrence of Association.


What you might do is for example exclude possible characters to match using a negated character class :

\bAmerican\b[^/,.]*\bAssociation\b

Regex demo

Or you might use a tempered greedy token approach to not allow specific words between the first and last part:

\bAmerican\b(?:(?!American\b|Association\b).)*\bHeart Association\b

Regex demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM