简体   繁体   English

在不使用正则表达式的情况下在字符串中查找模式

[英]find pattern in a string without using regex

I'm trying to find a pattern in a string.我试图在字符串中找到一个模式。 Example:例子:

trail = ' AABACCCACCACCACCACCACC " one can note the " ACC " repetition after a prefix of AAB; so the result should be AAB(ACC) trail = ' AABACCCACCACCACCACCACC " 可以注意到 AAB 前缀后的 " ACC " 重复;所以结果应该是 AAB(ACC)

Without using regex 'import re' how can I do this.如果不使用正则表达式 'import re' 我该怎么做。 What I did so far:到目前为止我做了什么:

    def get_pattern(trail):
        for j in range(0,len(trail)):
            k = j+1
            while k<len(trail) and trail[j]!=trail[k]:
                k+=1
            if k==len(trail)-1:
                continue

            window = ''
            stop = trail[j]
            m = j
            while  m<len(trail) and k<len(trail) and trail[m]==trail[k]:
                window+=trail[m]
                m+=1
                k+=1
                if trail[m]==stop and len(window)>1: 
                    break

            if len(window)>1:
                prefix=''
                if j>0:
                    prefix = trail[0:j]
                return prefix+'('+window+')'
            
        return False

This will do (almost) the trick because in a use case like this: " AAAAAAAAAAAAAAAAAABDBDBDBDBDBDBDBDBDBDBDBDBDBDBDBD " the result is AA but it should be: AAAAAAAAAAAAAAAAAA(BD)这将(几乎)达到目的,因为在这样的用例中:“ AAAAAAAAAAAAAAAAAABDBDBDBDBDBDBDBDBDBDBDBDBDBDBDBD ”结果是AA但它应该是: AAAAAAAAAAAAAAAAAA(BD)

The issue with your code is that once you find a repetition that is of length 2 or greater, you don't check forward to make sure it's maintained.您的代码的问题在于,一旦您发现长度为 2 或更大的重复,您就不会向前检查以确保它得到维护。 In your second example, this causes it to grab onto the 'AA' without seeing the 'BD's that follow.在您的第二个示例中,这会导致它抓住“AA”而看不到后面的“BD”。

Since we know we're dealing with cases of prefix + window, it makes sense to instead look from the end rather than the beginning.因为我们知道我们正在处理前缀 + window 的情况,所以从末尾而不是从头开始看是有意义的。

def get_pattern(string):
    
    str_len = len(string)
    
    splits = [[string[i-rep_length: i] for i in range(str_len, 0, -rep_length)] for rep_length in range(1, str_len//2)]

    reps = [[window == split[0] for window in split].index(False) for split in splits]
    
    prefix_lengths = [str_len - (i+1)*rep for i,rep in enumerate(reps)]
    
    shortest_prefix_length = min(prefix_lengths)
    
    indices = [i for i, pre_len in enumerate(prefix_lengths) if pre_len == shortest_prefix_length]
    
    reps = list(map(reps.__getitem__, indices))
    splits = list(map(splits.__getitem__, indices))
    
    max_reps = max(reps)

    window = splits[reps.index(max_reps)][0]

    prefix = string[0:shortest_prefix_length]
    
    return f'{prefix}({window})' if max_reps > 1 else None

splits uses list comprehension to create a list of lists where each sublist splits the string into rep_length sized pieces starting from the end. splits使用列表理解来创建一个列表列表,其中每个子列表从末尾开始将字符串拆分为rep_length大小的片段。

For each sublist split , the first split[0] is our proposed pattern and we see how many times that it's repeated.对于每个子列表split ,第一个split[0]是我们提出的模式,我们会看到它重复了多少次。 This is easily done by finding the first instance of False when checking window == split[0] using the list.index() function. We also want to calculate the size of the prefix.这很容易通过在使用list.index() function 检查window == split[0]时找到False的第一个实例来完成。我们还想计算前缀的大小。 We want the shortest prefix with the largest number of reps.我们想要具有最多重复次数的最短前缀。 This is because of nasty edge cases like jeifjeiAABBBBBBBBBBBBBBAABBBBBBBBBBBBBBAABBBBBBBBBBBBBBAABBBBBBBBBBBBBB where the window has B that repeats more than the window itself.这是因为像jeifjeiAABBBBBBBBBBBBBBAABBBBBBBBBBBBBBAABBBBBBBBBBBBBBAABBBBBBBBBBBBBB这样令人讨厌的边缘情况,其中 window 的B重复次数超过 window 本身。 Additionally, anything that repeats 4 times can also be seen as a double-sized window repeated twice.此外,任何重复 4 次的东西也可以看作是双倍大小的 window 重复两次。

If you want to deal with an additional suffix, we can do a hacky solution by just trimming from the end until get_pattern() returns a pattern and then just append what was trimmed:如果你想处理一个额外的后缀,我们可以通过从末尾修剪直到get_pattern()返回一个模式然后只是 append 被修剪的内容来做一个 hacky 解决方案:

def get_pattern_w_suffix(string):
    
    for i in range(len(string), 0, -1):
        pattern = get_pattern(string[0:i])
        
        suffix = string[i:]
        
        if pattern is not None:
            return pattern + suffix
        
    return None

However, this assumes that the suffix doesn't have a pattern itself.但是,这假设后缀本身没有模式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM