简体   繁体   中英

How can I allow a fuzzy regex match for only part of the pattern?

I have a pattern_string = 'ATAG/GAGAAGATGATG/TATA' and a query_string = 'ATAG/AGCAAGATGATG/TATA' . This works for the following regex match:

r = regex.compile('(%s){e<=2}' % pattern_string)

r.match(query_string)

Here, the only change is between the two / characters. However, I want to restrict the fuzziness of the match to only be allowed between these characters, while the characters outside of the / bounds remain an exact match.

For example, pattern_string = 'ATGG/GAGAAGATGATG/TATA' and query_string = 'ATAG/AGCAAGATGATG/TATA' is not a match, because the first part of the string ( ATGG vs ATAG ) does not match. Similarly, pattern_string = 'ATAG/GAGAAGATGATG/TATG' and query_string = 'ATAG/AGCAAGATGATG/TATA' is also not a match, because the last part of the string ( TATG vs TATA ) does not match.

In summary, the portion of the string within the / (or any delimiter character) should be allowed a fuzzy match according to what is specified to the regex ( {e<=2} in this case), but the string outside must be an exact match.

How can this be achieved?

I am imagining a function like the following

ideal_function(pattern_string, query_string)

Where

ideal_function(pattern_string = 'ATAG/GAGAAGATGATG/TATA', query_string = 'ATAG/AGCAAGATGATG/TATA') returns True ideal_function(pattern_string = 'ATGG/GAGAAGATGATG/TATA', query_string = 'ATAG/AGCAAGATGATG/TATA') returns False

The most efficient method for this would be appreciated, I have to do this on over 20,000 pattern strings with a combination of over 5 million query strings, so it needs to be as efficient as possible. It does not necessarily have to be a regex solution, though it must support the option of allowing for fuzzy match for both substitution count (as in {s<=2} ) and error count (as in {e<=2} ) specified.

You can limit fuzziness to the section of the pattern between slashes using the following implementation of your desired ideal_function() :

def ideal_function(pattern_string, query_string, fuzzy='e<=2'):
    prefix, body, suffix = pattern_string.split('/')
    r = regex.compile('%s/(%s){%s}/%s' % (prefix, body, fuzzy, suffix))
    return r.match(query_string) is not None

Here it is in action:

>>> ideal_function('ATAG/GAGAAGATGATG/TATA', 'ATAG/AGCAAGATGATG/TATA')
True

>>> ideal_function('ATGG/GAGAAGATGATG/TATA', 'ATAG/AGCAAGATGATG/TATA')
False

>>> ideal_function('ATAG/GAGAAGATGATG/TATA', 'ATAG/AGCAAGATGATG/TATA', 'e<=1')
False

>>> ideal_function('ATAG/GAGAAGATGATG/TATA', 'ATAG/AGCAAGATGATG/TATA', 'e<=2')
True

>>> ideal_function('ATAG/GAGAAGATGATG/TATA', 'ATAG/AGCAAGATGATG/TATA', 's<=2')
False

>>> ideal_function('ATAG/GAGAAGATGATG/TATA', 'ATAG/AGCAAGATGATG/TATA', 's<=3')
True

This relies on your always having exactly three slash-delimited sections in the pattern, but since anything more generalised would also require specifying which sections are fuzzy and which are non-fuzzy somehow, I assume this straightforward approach fits your use case.

Any version of ideal_function() which has to create the appropriate regular expression every time it's called probably isn't going to be the most efficient approach, by the way (although you'd have to do some profiling to establish how much difference it actually makes in your particular case).

Depending on the kind of output you need, something like this might make more sense:

def ideal_generator(pattern_string, all_query_strings, fuzzy='e<=2'):
    prefix, body, suffix = pattern_string.split('/')
    r = regex.compile('%s/(%s){%s}/%s' % (prefix, body, fuzzy, suffix))
    for query_string in all_query_strings:
        if r.match(query_string) is not None:
            yield query_string

… which would yield all query strings matching pattern_string .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM