简体   繁体   中英

Python regex prefers longer fuzzy match to shorter exact match

I am using regex in Python to search for multiple patterns in a string. A simplified example would be as follows:

import regex
s = "vrhvydhvkzejjvksdlstringvhehvehvurejlcslvdk"  #string to look into
p = ['(?P<string>string)', '(?P<longtext>longtext)']  #patterns to search for
r = regex.compile('(?b)(' + " | ".join(p) + '){s<=3}')  #regex, allowing for 3 mismatches, bestmatch to be reported
r.search(s)   #searching for patterns p in string s
<regex.Match object; span=(18, 25), match='stringv', fuzzy_counts=(1, 0, 0)>   #search results

My expected result would be:

<regex.Match object; span=(18, 24), match='string', fuzzy_counts=(0, 0, 0)>

Why do regex reports a fuzzy match stringv with 1 mismatch instead of reporting the exact match string ? And how do I need to modify my code to get to my expected results?

I am with Python-3.7.3 and regex 2.5.115

The '(?e)(' + " | ".join(p) + '){s<=3}' results in a (?e)((?P<string>string) | (?P<longtext>longtext)){s<=3} regex, see the spaces around | . Since v is substituted for a space when matching the (?P<string>string) regex part, you get stringv as a match.

You need

r = regex.compile('(?b)(' + "|".join(p) + '){s<=3}')  #regex, allowing for 3 mismatches, bestmatch to be reported

See the Python demo :

import regex
s = "vrhvydhvkzejjvksdlstringvhehvehvurejlcslvdk"  #string to look into
p = ['(?P<string>string)', '(?P<longtext>longtext)']  #patterns to search for
rx = '(?e)(' + "|".join(p) + '){s<=3}' 
r = regex.compile(rx)  #regex, allowing for 3 mismatches, bestmatch to be reported
print( r.search(s) )
# => <regex.Match object; span=(18, 24), match='string'>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM