简体   繁体   English

如何针对最长公共子序列优化O(m.n)解?

[英]How to optimise the O(m.n) solution for longest common subsequence?

Given two strings string X of length x1 and string Y of length y1, find the longest sequence of characters that appear left to right (but not necessarily in contiguous block) in both strings. 给定两个长度为x1的字符串X和长度为y1的字符串Y,找到两个字符串中从左到右(但不一定在连续块中)出现的最长字符序列。

eg if X = ABCBDAB and Y = BDCABA, the LCS(X,Y) = {"BCBA","BDAB","BCAB"} and LCSlength is 4. 例如,如果X = ABCBDAB且Y = BDCABA,则LCS(X,Y)= {“BCBA”,“BDAB”,“BCAB”}和LCSlength为4。

I used the standard solution for this problem: 我使用标准解决方案来解决这个问题:

if(X[i]=Y[j]) :1+LCS(i+1,j+1)
if(X[i]!=Y[j]) :LCS(i,j+1) or LCS(i+1,j), whichever is greater

and then I used memorization, making it a standard DP problem. 然后我使用了记忆,使其成为标准的DP问题。

    #include<iostream>
    #include<string>
    using namespace std;
    int LCS[1024][1024];

     int LCSlen(string &x, int x1, string &y, int y1){

        for(int i = 0; i <= x1; i++)
            LCS[i][y1] = 0;

        for(int j = 0; j <= y1; j++)
             LCS[x1][j] = 0;

        for(int i = x1 - 1; i >= 0; i--){

            for(int j = y1 - 1; j >= 0; j--){

                LCS[i][j] = LCS[i+1][j+1];

                if(x[i] == y[j])
                LCS[i][j]++;

                if(LCS[i][j+1] > LCS[i][j])
                LCS[i][j] = LCS[i][j+1];

                if(LCS[i+1][j] > LCS[i][j])
                LCS[i][j] = LCS[i+1][j];

            }
        }

    return LCS[0][0];
    } 

    int main()
    {
        string x;
        string y;
        cin >> x >> y;
        int x1 = x.length() , y1 = y.length();
        int ans = LCSlen( x, x1, y, y1);
        cout << ans << endl;
        return 0;
    }

Running here , this solution I used in SPOJ and I got a time limit exceeded and/or runtime error. 在这里运行,这个我在SPOJ中使用的解决方案,我得到了超出时间限制和/或运行时错误。

Only 14 user solutions are yet accepted. 目前仅接受了14种用户解决方案。 Is there a smarter trick to decrease the time complexity of this question? 有没有更智能的技巧来减少这个问题的时间复杂性?

LCS is a classical, well studied computer science problem, and for the case with two sequences it is known that its lower bound is O(n·m). LCS是经典的,充分研究的计算机科学问题,对于具有两个序列的情况,已知其下限是O(n·m)。

Furthermore, your algorithm implementation has no obvious efficiency bugs, so it should run close to as fast as possible (although it may be beneficial to use a dynamically sized 2D matrix rather than an oversized one, which takes up 4 MiB of memory, and will require frequent cache invalidation (which is a costly operation, since it causes a transfer from main memory to the processor cache, which is several orders of magnitude slower than cached memory access). 此外,您的算法实现没有明显的效率错误,因此它应该尽可能快地运行(尽管使用动态大小的2D矩阵而不是超大的2D矩阵可能是有益的,它占用4 MiB的内存,并且需要频繁的高速缓存失效(这是一项代价高昂的操作,因为它导致从主存储器到处理器高速缓存的转移,这比缓存的存储器访问慢几个数量级)。

In terms of algorithm, in order to lower the theoretical bound you need to exploit specifics of your input structure: for instance, if you are searching one of the strings repeatedly, it may pay to build a search index which takes some processing time, but will make the actual search much faster. 在算法方面,为了降低理论界限,您需要利用输入结构的细节:例如,如果您反复搜索其中一个字符串,构建搜索索引可能需要花费一些处理时间,但是将使实际搜索更快。 Two classical variants of that are the suffix array and the suffix tree . 其中两个经典变体是后缀数组后缀树

If it is known that at least one of your strings is very short (< 64 characters) you can use Myers' bit vector algorithm , which performs much faster. 如果已知至少有一个字符串非常短(<64个字符),则可以使用Myers的位向量算法 ,该算法的执行速度要快得多。 Unfortunately the algorithm is far from trivial to implement. 不幸的是,算法实现起来远非微不足道。 There exists an implementation in the SeqAn library , but using the library itself has a steep learning curve. SeqAn库中存在一个实现 ,但使用库本身具有陡峭的学习曲线。

(As a matter of interest, this algorithm finds frequent application in bioinformatics, and has been used during the sequence assembly in the Human Genome Project.) (作为一个有趣的问题,该算法在生物信息学中经常应用,并且已经在人类基因组计划的序列组装过程中使用。)

Although I still didn't get an AC because of time limit exceeded ,I was however able to implement the linear space algorithm.In case anyone wants to see, here is the c++ implementation of the Hirschbirg algorithm. 虽然由于超出时间限制我仍然没有获得AC,但我能够实现线性空间算法。如果有人想看,这里是Hirschbirg算法的c ++实现。

#include <cstdlib>
#include <algorithm>
#include <iostream>
#include <cstring>
#include <string>
#include <cstdio>
using namespace std;

int* compute_help_table(const string & A,const string & B);
string lcs(const string & A, const string & B);
string simple_solution(const string & A, const string & B);

int main(void) {
    string A,B;
    cin>>A>>B;

    cout << lcs(A, B).size() << endl;

    return 0;
}

string lcs(const string &A, const string &B) {
    int m = A.size();
    int n = B.size();

    if (m == 0 || n == 0) {
        return "";
    }
    else if(m == 1) {
        return simple_solution(A, B);
    }
    else if(n == 1) {
        return simple_solution(B, A);
    }
    else {
        int i = m / 2;

        string Asubstr = A.substr(i, m - i);
        //reverse(Asubstr.begin(), Asubstr.end());
        string Brev = B;
        reverse(Brev.begin(), Brev.end());

        int* L1 = compute_help_table(A.substr(0, i), B);
        int* L2 = compute_help_table(Asubstr, Brev);

        int k;
        int M = -1;
        for(int j = 0; j <= n; j++) {
            if(M < L1[j] + L2[n-j]) {
                M = L1[j] + L2[n-j];
                k = j;
            }
        }

        delete [] L1;
        delete [] L2;

        return lcs(A.substr(0, i), B.substr(0, k)) + lcs(A.substr(i, m - i), B.substr(k, n - k));
    }
}

int* compute_help_table(const string &A, const string &B) {
    int m = A.size();
    int n = B.size();

    int* first = new int[n+1];
    int* second = new int[n+1];

    for(int i = 0; i <= n; i++) {
        second[i] = 0;
    }

    for(int i = 0; i < m; i++) {
        for(int k = 0; k <= n; k++) {
            first[k] = second[k];  
        }

        for(int j = 0; j < n; j++) {
            if(j == 0) {
                if (A[i] == B[j])
                    second[1] = 1;
            }
            else {
                if(A[i] == B[j]) {
                    second[j+1] = first[j] + 1;
                }
                else {
                    second[j+1] = max(second[j], first[j+1]);
                }
            }
        }
    }

    delete [] first;
    return second;
}

string simple_solution(const string & A, const string & B) {
    int i = 0;
    for(; i < B.size(); i++) {
        if(B.at(i) == A.at(0))
            return A;
    }

    return "";
}

Running here . 在这里

If the two strings share a common prefix (eg "ABCD" and "ABXY" share "AB") then that will be part of the LCS. 如果两个字符串共享一个公共前缀(例如“ABCD”和“ABXY”共享“AB”)那么这将是LCS的一部分。 Same for common suffixes. 普通后缀也是如此。 So for some pairs of strings you can gain some speed by skipping over the longest common prefix and longest common suffix before starting the DP algorithm; 因此,对于某些字符串对,您可以在启动DP算法之前跳过最长公共前缀和最长公共后缀来获得一些速度; this doesn't change the worst-case bounds, but it changes the best case complexity to linear time and constant space. 这不会改变最坏情况的界限,但它会将最佳情况复杂度更改为线性时间和恒定空间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM