[英]Is there a fast algorithm to remove repeated substrings in a string?
There is a string like it 有一个像它的字符串
dxabcabcyyyydxycxcxz
and I want to merge it into 我想将它合并到
dxabcydxycxz
Other examples: ddxddx -> dxdx , abbab -> abab. 其他例子: ddxddx - > dxdx,abbab - > abab。
The rule is that : 规则是:
if (adjacent and same): merge
# Such as 'abc',they are same and , so I will delete one of them .
# Although 'dx' is same as 'dx',they are nonadjacent,so I do not delete any of them
# If one character has been deleted, we don't delete any sub-string include it
I did it in my code in python,but it's slow when did in a long string. 我在python的代码中完成了它,但是当它在一个长字符串中时它很慢。
# original string
mystr = "dxabcabcyyyydxycxcxz"
str_len = len(mystr)
vis = [1] *str_len #Use a list to mark which char is deleted
# enumerate the size of sub-str
for i in range(1,str_len):
# enumerate the begin of the sub-str
for j in range(0, str_len):
offset = 2 #the size of sub-str + 1
current_sub_str = mystr[j:j+i]
s_begin = j+i*(offset-1)
s_end = j+(i*offset)
# delete all of the same char
while((j+(i*offset) <= str_len) and current_sub_str == mystr[s_begin:s_end]
and 0 not in vis[s_begin:s_end] and 0 not in vis[j:j+i]):
vis[s_begin:s_end] = [0] * (s_end - s_begin) #if I deleted it ,mark it as 0
offset += 1
s_begin = j + i * (offset - 1)
s_end = j + (i * offset)
res = []
for i in range(0,str_len):
if(vis[i]!=0): res.append(mystr[i])
print "".join(res)
Is there any faster way to solve it? 有没有更快的方法来解决它?
update April 29, 2017 2017年4月29日更新
Sorry, it seems to like a XY problem.On the other hand,it maybe not. 对不起,它似乎是一个XY问题。另一方面,它可能不是。 there is the content 有内容
I was coding for a web spider and got many 'tag-path's like those 我正在编写一个网络蜘蛛编码,并获得了许多像这样的标记路径
ul/li/a
ul/li/div/div/div/a/span
ul/li/div/div/div/a/span
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
As you see,there are some 'tag-path' did a same way,so I wanted to collapse them to find is there any other 'tag-path's have the same structure. 正如你所看到的,有一些'tag-path'以同样的方式做了,所以我想折叠它们以发现是否有任何其他'tag-path具有相同的结构。 After collapsing, I get the 'tag-path' like this. 折叠后,我得到这样的“标记路径”。
ul/li/a
ul/li/div/div/div/a/span
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
This is only my idea and I didn't know whether it is suitable to do in this way.(After trying, I chose another way to do it. 这只是我的想法,我不知道这样做是否合适。(尝试之后,我选择了另一种方式来做到这一点。
However there is an interesting question like a ACM question. 然而,有一个有趣的问题,如ACM问题。
So I simplify one 'tag-path' to a character and ask for help.Because I didn't do a fast way by myself. 因此,我将一个“标记路径”简化为一个角色并寻求帮助。因为我自己并没有快速做到这一点。 Actually, the question has many corner cases that I don't mind and thank all for helping me complete it. 实际上,这个问题有许多我不介意的角落案例,感谢大家帮助我完成它。
Thanks all. 谢谢大家。
Behold the power of regex: 看看正则表达的力量:
>>> import re
>>> re.sub(r"(.+?)\1+", r"\1", "dxabcabcyyyydxycxcxz")
'dxabcydxycxz'
>>> re.sub(r"(.+?)\1+", r"\1", "ddxddx")
'dxdx'
>>> re.sub(r"(.+?)\1+", r"\1", "abbab")
'abab'
This looks for a sequence of 1 or more arbitrary characters (.+?)
(as a non-greedy match, so that it tries shorter sequences first), followed by 1 or more repetitions of the matched sequence \\1+
, and replaces it all with just the matched sequence \\1
. 这将查找一个包含1个或多个任意字符(.+?)
的序列(作为非贪婪匹配,以便它首先尝试更短的序列),然后重复匹配序列1的重复\\1+
,并替换它所有只有匹配的序列\\1
。
This can be a start: 这可以是一个开始:
for i in range(len(string)):
for j in range(i + 1, len(string)):
while string[i:j] == string[j:j + j - i]:
string = string[:j] + string[j + j - i:]
The result on the examples provided: 提供的示例结果如下:
abbab -> abab
ddxddx -> dxdx
abcabcabc -> abc
dxabcabcyyyydxycxcxz -> dxabcydxycxz
This is a great question/series of responses! 这是一个很好的问题/系列回复!
Here's an implementation using a generator and string slicing: 这是使用生成器和字符串切片的实现:
import math
def dedupe(string, step=1):
index = 0
prior = ''
while index < len(string):
letter = string[index]
window = index + step
comparison = string[index:window]
if comparison != prior:
yield letter
prior += letter
index += 1
else:
index += step
if len(prior) > (step):
prior = prior[1:] # remove first character
def collapse(string):
step = 1
while step < math.sqrt(len(string)):
generator = dedupe(string, step=step)
string = ''.join(generator)
step +=1
return string
Edit: changed the step search to use the square root of the length to improve search times: 编辑:更改了步骤搜索以使用长度的平方根来改善搜索时间:
%timeit collapse('dxabcabcyyyydxycxcxz')
10000 loops, best of 3: 24.7 µs per loop %timeit collapse('dxabcabcyyyydxycxcxz')
10000循环,最佳3:每循环%timeit collapse('dxabcabcyyyydxycxcxz')
%timeit collapse(randomword(100)
1000 loops, best of 3: 384 µs per loop %timeit collapse(randomword(100)
1000循环,最佳3: %timeit collapse(randomword(100)
每循环 %timeit collapse("a" * 100)
10000 loops, best of 3: 27.1 µs per loop %timeit collapse("a" * 100)
10000循环,最佳3:每循环27.1μs %timeit collapse(randomword(50) * 2)
1000 loops, best of 3: 382 µs per loop %timeit collapse(randomword(50) * 2)
1000循环,最佳3:每循环%timeit collapse(randomword(50) * 2)
One line: 一条线:
def remove_repeats(iterable):
return [e for (i, e) in enumerate(iterable) if i == 0 or e != iterable[i - 1]]
It works with any iterable data, returns list. 它适用于任何可迭代数据,返回列表。
>>> print remove_repeats("aaabbc")
['a', 'b', 'c']
>>> s = '''
... ul/li/a
... ul/li/div/div/div/a/span
... ul/li/div/div/div/a/span
... ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... '''
>>> print remove_repeats(s.split())
['ul/li/a', 'ul/li/div/div/div/a/span', 'ul/li/a', 'ul/li/ul/li/a', 'ul/li/a', '
ul/li/ul/li/a', 'ul/li/a', 'ul/li/ul/li/a']
Join if you need a string: 如果您需要字符串,请加入:
>>> print "".join(remove_repeats('111222333'))
123
>>> print "\n".join(remove_repeats(s.split()))
ul/li/a
ul/li/div/div/div/a/span
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
from collections import OrderedDict
mystr = "dxabcabcyyyydxycxcxz"
index=0;indexs = [];count = OrderedDict()
while count!=None:
count = {}
for index in range(0,len(mystr)):
flag = True
for index1 in range(0,index+1)[::-1]:
if(mystr.startswith(mystr[index1:index+1], index+1)):
if count.get(str(index1),0)<(index+1-index1):
count.update({str(index1) : index+1-index1})
for key in count:
mystr = mystr[:int(key)]+mystr[int(key)+count[key]:]
if count=={}:
count=None
print "Answer:", mystr
One linear approach 一种线性方法
import itertools
_str = 'dxabcabcyyyydxycxcxz'
print ''.join(ch for ch, _ in itertools.groupby(_str))
result: 结果:
dxabcabcyyyydxycxcxz -> dxabcabcydxycxcxz dxabcabcyyyydxycxcxz - > dxabcabcydxycxcxz
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.