简体   繁体   English

在“ n”个二进制字符串中找到最长的公共子字符串的长度

[英]Find the length of the longest common substring in 'n' binary strings

I am given n strings (n>=2 and n<=4) and each one is constructed using 2 letters only: a and b . 我得到了n字符串(n> = 2和n <= 4),每个字符串仅使用2个字母构造: ab In this set of strings I have to find the length of the longest common substring that is present in all the strings. 在这组字符串中,我必须找到所有字符串中存在的最长的公共子字符串的长度。 A solution is guaranteed to exist. 保证存在解决方案。 Let's see an example: 让我们来看一个例子:

n=4
abbabaaaaabb
aaaababab
bbbbaaaab
aaaaaaabaaab

The result is 5 (because the longest common substring is "aaaab").

I don't have to print (or even know) the substring, I just need to print its length. 我不必打印(甚至不知道)子字符串,只需要打印其长度即可。

It is also given that the result cannot be greater than 60 , even though the length of each string can be as high as 13 000 . 还假定即使每个字符串的长度可以高达13 000 ,结果也不能大于60

What I tried is this: I found the smallest length of any string of the given strings and then I compared it with 60 and I chose the smallest value between the two as starting point . 我尝试的是:找到给定字符串中任何字符串的最小长度,然后将其与60进行比较,然后选择两者之间的最小值作为starting point Then I started taking sequences of the first string, and the length of each sequence of the first string is len , where len takes values from starting point to 1 . 然后,我开始获取第一个字符串的序列,并且第一个字符串的每个序列的长度为len ,其中len接受从starting point1值。 At each iteration I take all possible sequences of the first string of length len and I use it as pattern . 在每次迭代中,我都会采用长度为len的第一个字符串的所有可能序列,并将其用作pattern Using the KMP algorithm (thus, complexity of O(n+m) ), I iterated through all the other strings (from 2 to n ) and checked if pattern is found in string i . 使用KMP算法(因此,复杂度为O(n+m) ),我遍历了所有其他字符串(从2n ),并检查是否在字符串i找到了pattern Whenever it isn't found, I break the iteration and try the next sequence available of length len or, if there isn't any, I decrease len and try all sequences that have as length the new, decreased value len . 每当找不到它时,我都会中断迭代并尝试使用长度为len的下一个序列,或者,如果没有任何序列,则减小len并尝试所有具有新长度且减小的值len序列。 But if it matches, I stop the program and print the length len , since we started from the longest possible length, decreasing at each step, it is logical that the first match that we find represents the largest possible length. 但是,如果匹配,我将停止程序并打印长度len ,因为我们从可能的最长长度开始,并在每一步减小,因此逻辑上我们找到的第一个匹配代表最大可能的长度。 Here is the code (but it doesn't really matter since this method is not good enough; I know I shouldn't use using namespace std but it doesn't really affect this program so I just didn't bother): 这是代码(但实际上并不重要,因为此方法还不够好;我知道我不应该using namespace std它,但它并不会真正影响该程序,因此我不会打扰):

#include <iostream>
#include <string>
#define nmax 50001
#define result_max 60

using namespace std;

int n,m,lps[nmax],starting_point,len;
string a[nmax],pattern,str;

void create_lps() {
    lps[0]=0;
    unsigned int len=0,i=1;
    while (i < pattern.length()) {
        if (pattern[i] == pattern[len]) {
            len++;
            lps[i] = len;
            i++;
        }
        else {
            if (len != 0) {
                len = lps[len-1];
            }
            else {
                lps[i] = 0;
                i++;
            }
        }
    }
}

bool kmp_MatchOrNot(int index) {
    unsigned int i=0,j=0;
    while (i < a[index].length()) {
        if (pattern[j] == a[index][i]) {
            j++;
            i++;
        }
        if (j == pattern.length()) {
            return true;
        }
        else if (i<a[index].length() && pattern[j]!=a[index][i]){
            if (j != 0) {
                j = lps[j-1];
            }
            else {
                i++;
            }
        }
    }
    return false;
}

int main()
{
    int i,left,n;
    unsigned int minim = nmax;
    bool solution;
    cin>>n;
    for (i=1;i<=n;i++) {
        cin>>a[i];
        if (a[i].length() < minim) {
            minim = a[i].length();
        }
    }

    if (minim < result_max) starting_point = minim;
    else starting_point = result_max;

    for (len=starting_point; len>=1; len--) {
        for (left=0; (unsigned)left<=a[1].length()-len; left++) {
            pattern = a[1].substr(left,len);
            solution = true;
            for (i=2;i<=n;i++) {
                if (pattern.length() > a[i].length()) {
                    solution = false;
                    break;
                }
                else {
                    create_lps();
                    if (kmp_MatchOrNot(i) == false) {
                        solution = false;
                        break;
                    }
                }
            }
            if (solution == true) {
                cout<<len;
                return 0;
            }
        }
    }
    return 0;
}

The thing is this: the program works correctly and it gives the right results, but when I sent the code on the website, it gave a 'Time limit exceeded' error, so I only got half the points. 关键是:程序可以正常工作,并且给出正确的结果,但是当我在网站上发送代码时,出现了“超过时间限制”错误,所以我只得到一半的分数。

This leads me to believe that, in order to solve the problem in a better time complexity, I have to take advantage of the fact that the letters of the string can only be a or b , since it looks like a pretty big thing that I didn't use, but I don't have any idea as to how exactly could I use this information. 这使我相信,为了更好地解决时间复杂性问题,我必须利用以下事实:字符串的字母只能是ab ,因为这看起来像是我的一件大事没有使用,但对于如何使用这些信息我一无所知。 I would appreciate any help. 我将不胜感激任何帮助。

The answer is to build the suffix trees of all of the strings individually, then intersect them. 答案是分别构建所有字符串的后缀树,然后将它们相交。 A suffix tree is like a trie that contains all suffixes of one string simultaneously. 后缀树就像一个trie,它同时包含一个字符串的所有后缀。

Building a suffix tree for a fixed alphabet is O(n) with Ukkonen's algorithm . 使用Ukkonen算法 ,为固定的字母构建后缀树为O(n) (If you don't like that explanation, you can use google to find others.) If you have m trees of size n , this is time O(nm) . (如果您不喜欢该说明,则可以使用Google查找其他人。)如果您有m个大小为n树,则时间为O(nm)

Intersecting suffix trees is a question of traversing them in parallel, only going further when you can go further in all trees. 与后缀树相交是并行遍历它们的问题,只有在所有树都可以走得更远时才走得更远。 If you have m trees of size n , this operation can be done in time no more than O(nm) . 如果您有m个大小为n树,则可以在不超过O(nm)时间内完成此操作。

The overall time of this algorithm is time O(nm) . 该算法的总时间为时间O(nm) Given that just reading the strings is of time O(nm) , you can't do better than that. 鉴于仅读取字符串的时间为O(nm) ,那么您做得更好。


Adding a small amount of detail, suppose that your suffix tree is written as one character per node. 添加少量细节,假设后缀树写为每个节点一个字符。 So each node is just a dictionary whose keys are characters and whose values are the rest of the tree. 因此,每个节点只是一个字典,其键是字符,其值是树的其余部分。 So to us your example, for the string ABABA the diagram at https://imgur.com/a/tnVlSI1 would turn into a data structure something like (see below) this one: 因此,以我们的示例为例,对于字符串ABABAhttps: ABABA的图将变成一种数据结构(如下所示):

{
    'A': {
        'B': {
            '': None,
            'A': {
                'B': {
                    '': None
                }
            }
        }
    },
    'B': {
        '': None
        'A': {
            'B': {
                '': None
            }
        }
    }
}

And likewise BABA would turn into: 同样, BABA会变成:

{
    'A': {
        '': None
        'B': {
            'A': {
                '': None
            }
        }
    },
    'B': {
        'A': {
            '': None,
            'B': {
                'A': {
                    '': None
                }
            }
        }
    }
}

With data structures that look like this, naive Python to compare them looks like: 使用如下所示的数据结构,朴素的Python可以将它们进行比较:

def tree_intersection_depth (trees):
    best_depth = 0
    for (char, deeper) in trees[0].items():
        if deeper is None:
            continue
        failed = False

        deepers = [deeper]
        for tree in trees[1:]:
            if char in tree:
                deepers.append(tree[char])
            else:
                failed = True
                break

        if failed:
            continue

        depth = 1 + tree_intersection_depth(deepers)
        if best_depth < depth:
            best_depth = depth

    return best_depth

And you would call it like tree_intersection_depth([tree1, tree2, tree3, ...]) . 您将其称为tree_intersection_depth([tree1, tree2, tree3, ...])

With the above two trees it does indeed give 3 as the answer. 有了以上两棵树,确实给出了3的答案。

Now I actually cheated in writing out that data structure. 现在,我实际上在写出该数据结构时受骗。 What makes suffix trees efficient is that you DON'T actually have a data structure that looks like that. 使后缀树高效的原因是您实际上没有像这样的数据结构。 You have one that reuses all of the repeated structures. 您可以重用所有重复的结构。 So the code to simulate setting up the data structures and calling it looks like this: 因此,用于模拟设置数据结构并调用它的代码如下所示:

b_ = {'B': {'': None}}
ab_ = {'': None, 'A': b_}
bab_ = {'B': ab_}
abab = {'A': bab_, 'B': ab_}

a_ = {'A': {'': None}}
ba_ = {'': None, 'B': a_}
aba_ = {'A': ba_}
baba = {'B': aba_, 'A': ba_}

print(tree_intersection_depth([abab, baba]))

And now we can see that to get the promised performance, there is a missing step. 现在我们可以看到,要获得预期的性能,需要执行的步骤有所遗漏。 The problem is that while the size of the tree is O(n) , while searching it we will potentially visit O(n^2) substrings. 问题是,当树的大小为O(n) ,在搜索树时,我们可能会访问O(n^2)子字符串。 In your case you don't have to worry about that, because the substrings are guaranteed to never go to depth more than 60. But in the fully general case you would need to add memoization so that when recursion results in comparing data structures that you have seen before, you immediately return the old answer, and not the new one. 在您的情况下,您不必担心,因为保证子字符串的深度永远不会超过60。但是在一般情况下,您将需要添加备注,以便在递归结果比较数据结构时以前已经看到过,您会立即返回旧答案,而不是新答案。 (In Python you would use the id() method to compare the address of the object to ones you've seen before. In C++, have a set of pointer tuples for the same purpose.) (在Python中,您可以使用id()方法将对象的地址与您之前看到的地址进行比较。在C ++中,出于相同的目的,有一组指针元组。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM