简体   繁体   中英

Check if string contains pattern (with respect to one-to-one symbols mapping)

I'm trying to solve following problem:

Check, if the string contains substring, that can be obtained from pattern using one to one lower case symbols replacement (using bijection between original lower case letters and new letters).

My strings only consist of upper case and lower case english alphabet letters ( A–Z, a–z ) .

For example:

Positive :
string: justanexampleXstring
pattern: onecompleX
answer: true (anexampleX <-> onecompleX, used mapping: a<->o, x<->c; uppercase is not affected by mapping)
Negative:
string: AaBbDd
pattern: AaBa
answer: false (two repeated lower case letters in pattern and no repeated lower case letters in original string)

Currently I'm thinking about utilizing Rabin–Karp algorithm in multiple pattern search setup:

A naïve way to search for k patterns is to repeat a single-pattern search taking O(n+m) time, totaling in O((n+m)k) time. In contrast, the above algorithm can find all k patterns in O(n+km) expected time, assuming that a hash table check works in O(1) expected time.

But naïve check of strings generated by all possible mappings leads to k= 26! ( factorial , number of all possible one to one mappings between 26 letters alphabet)

Any ideas or optimizations to current approach would be appreciated. Thank you in advance!

You can do this using a slight change to Rabin-Karp, in the same expected run-time (supposing your hash function meets the usual requirements of uniformity). The idea is to think about what kinds of strings are equivalent to the pattern under a lowercase-letter bijection map. First, all uppercase letters must match exactly, so we can ignore them for the moment. For lowercase letters, what matters is the partition of indices based on equal characters.

For instance, if pattern = 'banana', whose length is m=6 , then the grouping of indices will be 3 groups, one for 'b' indices, one for 'a's, and one for 'n's.
The (unordered) partition is then ([0], [1, 3, 5], [2, 4]) . Now, let's find an equivalent string representation. Instead of this partition, replace each lowercase character of the pattern with the distance to the next equal character (ie, the next larger value in its partition group), or m , if this is the character's final appearance. This means our replaced string for 'banana' becomes
(6, 2, 2, 2, 6, 6) ; another length-m lowercase string has the same 'replaced' string if and only if it matches the pattern under grouping of equal-value indices, which matches the pattern if and only if they match under letter bijection.

We can use Rabin-Karp's rolling hash of length m , replacing lowercase letters as above, and mapping uppercase letters to another range of numbers above 'm' (say, m+1 through m+26). The only issue is that as our window slides to the right, a character in the middle of the window may change value:
given our replaced string of 'banana' (6, 2, 2, 2, 6, 6) , if the next character we add to the end of the string is 'a', our replaced string for 'ananaa' is (2, 2, 2, 6, 1, 6) . We can fix this by implementing a hashmap of the 'last seen' index for each lowercase letter as we process the string. Using the sum of coefficients of prime powers rolling hash, we can quickly calculate the change needed.

If you want to know the exact math for updating the rolling hash: after performing the usual window-shift update from Rabin-Karp, if the added character is lowercase and appeared j characters earlier in our window, its character value goes from 'm' to 'j'. For prime p , its contribution to the 'rolling hash modular sum' changes from m*(p^j) to j*(p^j) , and we can subtract the difference. Any rolling hash where you can easily compute any window character's current contribution to the hash will also work.

A more interesting question is whether Knuth-Morris-Pratt or Boyer-Moore can be adapted to work in a similar way; I have not found a similar transformation for those, since my algorithm depends on the rolling hash. Nevertheless, the average and worst-case runtime will be unchanged from Rabin-Karp.

I don't know about Rabin-Karp, but this is better than the k=26! you mentioned. I'm turning the pattern like onecompleX into a regular expression like (.)(.)(.)(.)\\1(.)(.)(.)\\3X :

import re

def solve(string, pattern):
    def replace(match, group={}):
        letter = match[0]
        if letter not in group:
            group[letter] = len(group) + 1
            return '(.)'
        return f'\\{group[letter]}'
    regex = re.sub('[a-z]', replace, pattern)
    return bool(re.search(regex, string))

print(solve('justanexampleXstring', 'onecompleX'))
print(solve('AaBbDd', 'AaBa'))

Output ( Try it online! ):

True
False

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM