简体   繁体   English

是否有快速算法来删除字符串中重复的子串?

[英]Is there a fast algorithm to remove repeated substrings in a string?

There is a string like it 有一个像它的字符串

dxabcabcyyyydxycxcxz

and I want to merge it into 我想将它合并到

dxabcydxycxz

Other examples: ddxddx -> dxdx , abbab -> abab. 其他例子: ddxddx - > dxdx,abbab - > abab。

The rule is that : 规则是:

if (adjacent and same): merge

# Such as 'abc',they are same and , so I will delete one of them .
# Although 'dx' is same as 'dx',they are nonadjacent,so I do not delete any of them
# If one character has been deleted, we don't delete any sub-string include it 

I did it in my code in python,but it's slow when did in a long string. 我在python的代码中完成了它,但是当它在一个长字符串中时它很慢。

# original string
mystr = "dxabcabcyyyydxycxcxz"
str_len = len(mystr)
vis = [1] *str_len #Use a list to mark which char is deleted

# enumerate the size of sub-str
for i in range(1,str_len):
    # enumerate the begin of the sub-str
    for j in range(0, str_len):
        offset = 2 #the size of sub-str + 1
        current_sub_str = mystr[j:j+i]
        s_begin = j+i*(offset-1)
        s_end = j+(i*offset)
        # delete all of the same char
        while((j+(i*offset) <= str_len) and current_sub_str == mystr[s_begin:s_end]
              and 0  not in vis[s_begin:s_end] and 0  not in vis[j:j+i]):
            vis[s_begin:s_end] = [0] * (s_end - s_begin) #if I deleted it ,mark it as 0
            offset += 1
            s_begin = j + i * (offset - 1)
            s_end = j + (i * offset)

res = []
for i in range(0,str_len):
    if(vis[i]!=0): res.append(mystr[i])

print "".join(res)

Is there any faster way to solve it? 有没有更快的方法来解决它?

update April 29, 2017 2017年4月29日更新

Sorry, it seems to like a XY problem.On the other hand,it maybe not. 对不起,它似乎是一个XY问题。另一方面,它可能不是。 there is the content 有内容

I was coding for a web spider and got many 'tag-path's like those 我正在编写一个网络蜘蛛编码,并获得了许多像这样的标记路径

ul/li/a
ul/li/div/div/div/a/span
ul/li/div/div/div/a/span 
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a
ul/li/ul/li/a

As you see,there are some 'tag-path' did a same way,so I wanted to collapse them to find is there any other 'tag-path's have the same structure. 正如你所看到的,有一些'tag-path'以同样的方式做了,所以我想折叠它们以发现是否有任何其他'tag-path具有相同的结构。 After collapsing, I get the 'tag-path' like this. 折叠后,我得到这样的“标记路径”。

ul/li/a
ul/li/div/div/div/a/span
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a

This is only my idea and I didn't know whether it is suitable to do in this way.(After trying, I chose another way to do it. 这只是我的想法,我不知道这样做是否合适。(尝试之后,我选择了另一种方式来做到这一点。

However there is an interesting question like a ACM question. 然而,有一个有趣的问题,如ACM问题。

So I simplify one 'tag-path' to a character and ask for help.Because I didn't do a fast way by myself. 因此,我将一个“标记路径”简化为一个角色并寻求帮助。因为我自己并没有快速做到这一点。 Actually, the question has many corner cases that I don't mind and thank all for helping me complete it. 实际上,这个问题有许多我不介意的角落案例,感谢大家帮助我完成它。

Thanks all. 谢谢大家。

Behold the power of regex: 看看正则表达的力量:

>>> import re

>>> re.sub(r"(.+?)\1+", r"\1", "dxabcabcyyyydxycxcxz")
'dxabcydxycxz'

>>> re.sub(r"(.+?)\1+", r"\1", "ddxddx")
'dxdx'

>>> re.sub(r"(.+?)\1+", r"\1", "abbab")
'abab'

This looks for a sequence of 1 or more arbitrary characters (.+?) (as a non-greedy match, so that it tries shorter sequences first), followed by 1 or more repetitions of the matched sequence \\1+ , and replaces it all with just the matched sequence \\1 . 这将查找一个包含1个或多个任意字符(.+?)的序列(作为非贪婪匹配,以便它首先尝试更短的序列),然后重复匹配序列1的重复\\1+ ,并替换它所有只有匹配的序列\\1

This can be a start: 这可以是一个开始:

for i in range(len(string)):
    for j in range(i + 1, len(string)):
        while string[i:j] == string[j:j + j - i]:
            string = string[:j] + string[j + j - i:]

The result on the examples provided: 提供的示例结果如下:

abbab  -> abab
ddxddx -> dxdx
abcabcabc -> abc
dxabcabcyyyydxycxcxz -> dxabcydxycxz

This is a great question/series of responses! 这是一个很好的问题/系列回复!

Here's an implementation using a generator and string slicing: 这是使用生成器和字符串切片的实现:

import math
def dedupe(string, step=1):
    index = 0
    prior = ''
    while index < len(string):
        letter = string[index]
        window = index + step
        comparison = string[index:window]
        if comparison != prior:
            yield letter
            prior += letter
            index += 1
        else:
            index += step
        if len(prior) > (step):
            prior = prior[1:] # remove first character


def collapse(string):
    step = 1
    while step < math.sqrt(len(string)):
        generator = dedupe(string, step=step)
        string = ''.join(generator)
        step +=1
    return string

Edit: changed the step search to use the square root of the length to improve search times: 编辑:更改了步骤搜索以使用长度的平方根来改善搜索时间:

  • %timeit collapse('dxabcabcyyyydxycxcxz') 10000 loops, best of 3: 24.7 µs per loop %timeit collapse('dxabcabcyyyydxycxcxz') 10000循环,最佳3:每循环%timeit collapse('dxabcabcyyyydxycxcxz')
  • %timeit collapse(randomword(100) 1000 loops, best of 3: 384 µs per loop %timeit collapse(randomword(100) 1000循环,最佳3: %timeit collapse(randomword(100)每循环
  • %timeit collapse("a" * 100) 10000 loops, best of 3: 27.1 µs per loop %timeit collapse("a" * 100) 10000循环,最佳3:每循环27.1μs
  • %timeit collapse(randomword(50) * 2) 1000 loops, best of 3: 382 µs per loop %timeit collapse(randomword(50) * 2) 1000循环,最佳3:每循环%timeit collapse(randomword(50) * 2)

One line: 一条线:

def remove_repeats(iterable):
    return [e for (i, e) in enumerate(iterable) if i == 0 or e != iterable[i - 1]]

It works with any iterable data, returns list. 它适用于任何可迭代数据,返回列表。

>>> print remove_repeats("aaabbc")
['a', 'b', 'c']

>>> s = '''
... ul/li/a
... ul/li/div/div/div/a/span
... ul/li/div/div/div/a/span
... ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... ul/li/ul/li/a
... '''

>>> print remove_repeats(s.split())
['ul/li/a', 'ul/li/div/div/div/a/span', 'ul/li/a', 'ul/li/ul/li/a', 'ul/li/a', '
ul/li/ul/li/a', 'ul/li/a', 'ul/li/ul/li/a']

Join if you need a string: 如果您需要字符串,请加入:

>>> print "".join(remove_repeats('111222333'))
123

>>> print "\n".join(remove_repeats(s.split()))
ul/li/a
ul/li/div/div/div/a/span
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
ul/li/a
ul/li/ul/li/a
from collections import OrderedDict
mystr = "dxabcabcyyyydxycxcxz"
index=0;indexs = [];count = OrderedDict()
while count!=None:
    count = {}
    for index in range(0,len(mystr)):
        flag = True
        for index1 in range(0,index+1)[::-1]:
            if(mystr.startswith(mystr[index1:index+1], index+1)):
                if count.get(str(index1),0)<(index+1-index1):
                    count.update({str(index1) : index+1-index1})
    for key in count:
        mystr = mystr[:int(key)]+mystr[int(key)+count[key]:]
    if count=={}:
        count=None
print "Answer:", mystr

One linear approach 一种线性方法

import itertools
_str = 'dxabcabcyyyydxycxcxz'
print ''.join(ch for ch, _ in itertools.groupby(_str))

result: 结果:

dxabcabcyyyydxycxcxz -> dxabcabcydxycxcxz dxabcabcyyyydxycxcxz - > dxabcabcydxycxcxz

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM