[英]find pattern in a string without using regex
I'm trying to find a pattern in a string.我试图在字符串中找到一个模式。 Example:例子:
trail = ' AABACCCACCACCACCACCACC
" one can note the " ACC
" repetition after a prefix of AAB; so the result should be AAB(ACC) trail = ' AABACCCACCACCACCACCACC
" 可以注意到 AAB 前缀后的 " ACC
" 重复;所以结果应该是 AAB(ACC)
Without using regex 'import re' how can I do this.如果不使用正则表达式 'import re' 我该怎么做。 What I did so far:到目前为止我做了什么:
def get_pattern(trail):
for j in range(0,len(trail)):
k = j+1
while k<len(trail) and trail[j]!=trail[k]:
k+=1
if k==len(trail)-1:
continue
window = ''
stop = trail[j]
m = j
while m<len(trail) and k<len(trail) and trail[m]==trail[k]:
window+=trail[m]
m+=1
k+=1
if trail[m]==stop and len(window)>1:
break
if len(window)>1:
prefix=''
if j>0:
prefix = trail[0:j]
return prefix+'('+window+')'
return False
This will do (almost) the trick because in a use case like this: " AAAAAAAAAAAAAAAAAABDBDBDBDBDBDBDBDBDBDBDBDBDBDBDBD
" the result is AA
but it should be: AAAAAAAAAAAAAAAAAA(BD)
这将(几乎)达到目的,因为在这样的用例中:“ AAAAAAAAAAAAAAAAAABDBDBDBDBDBDBDBDBDBDBDBDBDBDBDBD
”结果是AA
但它应该是: AAAAAAAAAAAAAAAAAA(BD)
The issue with your code is that once you find a repetition that is of length 2 or greater, you don't check forward to make sure it's maintained.您的代码的问题在于,一旦您发现长度为 2 或更大的重复,您就不会向前检查以确保它得到维护。 In your second example, this causes it to grab onto the 'AA' without seeing the 'BD's that follow.在您的第二个示例中,这会导致它抓住“AA”而看不到后面的“BD”。
Since we know we're dealing with cases of prefix + window, it makes sense to instead look from the end rather than the beginning.因为我们知道我们正在处理前缀 + window 的情况,所以从末尾而不是从头开始看是有意义的。
def get_pattern(string):
str_len = len(string)
splits = [[string[i-rep_length: i] for i in range(str_len, 0, -rep_length)] for rep_length in range(1, str_len//2)]
reps = [[window == split[0] for window in split].index(False) for split in splits]
prefix_lengths = [str_len - (i+1)*rep for i,rep in enumerate(reps)]
shortest_prefix_length = min(prefix_lengths)
indices = [i for i, pre_len in enumerate(prefix_lengths) if pre_len == shortest_prefix_length]
reps = list(map(reps.__getitem__, indices))
splits = list(map(splits.__getitem__, indices))
max_reps = max(reps)
window = splits[reps.index(max_reps)][0]
prefix = string[0:shortest_prefix_length]
return f'{prefix}({window})' if max_reps > 1 else None
splits
uses list comprehension to create a list of lists where each sublist splits the string into rep_length
sized pieces starting from the end. splits
使用列表理解来创建一个列表列表,其中每个子列表从末尾开始将字符串拆分为rep_length
大小的片段。
For each sublist split
, the first split[0]
is our proposed pattern and we see how many times that it's repeated.对于每个子列表split
,第一个split[0]
是我们提出的模式,我们会看到它重复了多少次。 This is easily done by finding the first instance of False
when checking window == split[0]
using the list.index()
function. We also want to calculate the size of the prefix.这很容易通过在使用list.index()
function 检查window == split[0]
时找到False
的第一个实例来完成。我们还想计算前缀的大小。 We want the shortest prefix with the largest number of reps.我们想要具有最多重复次数的最短前缀。 This is because of nasty edge cases like jeifjeiAABBBBBBBBBBBBBBAABBBBBBBBBBBBBBAABBBBBBBBBBBBBBAABBBBBBBBBBBBBB
where the window has B
that repeats more than the window itself.这是因为像jeifjeiAABBBBBBBBBBBBBBAABBBBBBBBBBBBBBAABBBBBBBBBBBBBBAABBBBBBBBBBBBBB
这样令人讨厌的边缘情况,其中 window 的B
重复次数超过 window 本身。 Additionally, anything that repeats 4 times can also be seen as a double-sized window repeated twice.此外,任何重复 4 次的东西也可以看作是双倍大小的 window 重复两次。
If you want to deal with an additional suffix, we can do a hacky solution by just trimming from the end until get_pattern()
returns a pattern and then just append what was trimmed:如果你想处理一个额外的后缀,我们可以通过从末尾修剪直到get_pattern()
返回一个模式然后只是 append 被修剪的内容来做一个 hacky 解决方案:
def get_pattern_w_suffix(string):
for i in range(len(string), 0, -1):
pattern = get_pattern(string[0:i])
suffix = string[i:]
if pattern is not None:
return pattern + suffix
return None
However, this assumes that the suffix doesn't have a pattern itself.但是,这假设后缀本身没有模式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.