简体   繁体   中英

How to optimise the O(m.n) solution for longest common subsequence?

Given two strings string X of length x1 and string Y of length y1, find the longest sequence of characters that appear left to right (but not necessarily in contiguous block) in both strings.

eg if X = ABCBDAB and Y = BDCABA, the LCS(X,Y) = {"BCBA","BDAB","BCAB"} and LCSlength is 4.

I used the standard solution for this problem:

if(X[i]=Y[j]) :1+LCS(i+1,j+1)
if(X[i]!=Y[j]) :LCS(i,j+1) or LCS(i+1,j), whichever is greater

and then I used memorization, making it a standard DP problem.

    #include<iostream>
    #include<string>
    using namespace std;
    int LCS[1024][1024];

     int LCSlen(string &x, int x1, string &y, int y1){

        for(int i = 0; i <= x1; i++)
            LCS[i][y1] = 0;

        for(int j = 0; j <= y1; j++)
             LCS[x1][j] = 0;

        for(int i = x1 - 1; i >= 0; i--){

            for(int j = y1 - 1; j >= 0; j--){

                LCS[i][j] = LCS[i+1][j+1];

                if(x[i] == y[j])
                LCS[i][j]++;

                if(LCS[i][j+1] > LCS[i][j])
                LCS[i][j] = LCS[i][j+1];

                if(LCS[i+1][j] > LCS[i][j])
                LCS[i][j] = LCS[i+1][j];

            }
        }

    return LCS[0][0];
    } 

    int main()
    {
        string x;
        string y;
        cin >> x >> y;
        int x1 = x.length() , y1 = y.length();
        int ans = LCSlen( x, x1, y, y1);
        cout << ans << endl;
        return 0;
    }

Running here , this solution I used in SPOJ and I got a time limit exceeded and/or runtime error.

Only 14 user solutions are yet accepted. Is there a smarter trick to decrease the time complexity of this question?

LCS is a classical, well studied computer science problem, and for the case with two sequences it is known that its lower bound is O(n·m).

Furthermore, your algorithm implementation has no obvious efficiency bugs, so it should run close to as fast as possible (although it may be beneficial to use a dynamically sized 2D matrix rather than an oversized one, which takes up 4 MiB of memory, and will require frequent cache invalidation (which is a costly operation, since it causes a transfer from main memory to the processor cache, which is several orders of magnitude slower than cached memory access).

In terms of algorithm, in order to lower the theoretical bound you need to exploit specifics of your input structure: for instance, if you are searching one of the strings repeatedly, it may pay to build a search index which takes some processing time, but will make the actual search much faster. Two classical variants of that are the suffix array and the suffix tree .

If it is known that at least one of your strings is very short (< 64 characters) you can use Myers' bit vector algorithm , which performs much faster. Unfortunately the algorithm is far from trivial to implement. There exists an implementation in the SeqAn library , but using the library itself has a steep learning curve.

(As a matter of interest, this algorithm finds frequent application in bioinformatics, and has been used during the sequence assembly in the Human Genome Project.)

Although I still didn't get an AC because of time limit exceeded ,I was however able to implement the linear space algorithm.In case anyone wants to see, here is the c++ implementation of the Hirschbirg algorithm.

#include <cstdlib>
#include <algorithm>
#include <iostream>
#include <cstring>
#include <string>
#include <cstdio>
using namespace std;

int* compute_help_table(const string & A,const string & B);
string lcs(const string & A, const string & B);
string simple_solution(const string & A, const string & B);

int main(void) {
    string A,B;
    cin>>A>>B;

    cout << lcs(A, B).size() << endl;

    return 0;
}

string lcs(const string &A, const string &B) {
    int m = A.size();
    int n = B.size();

    if (m == 0 || n == 0) {
        return "";
    }
    else if(m == 1) {
        return simple_solution(A, B);
    }
    else if(n == 1) {
        return simple_solution(B, A);
    }
    else {
        int i = m / 2;

        string Asubstr = A.substr(i, m - i);
        //reverse(Asubstr.begin(), Asubstr.end());
        string Brev = B;
        reverse(Brev.begin(), Brev.end());

        int* L1 = compute_help_table(A.substr(0, i), B);
        int* L2 = compute_help_table(Asubstr, Brev);

        int k;
        int M = -1;
        for(int j = 0; j <= n; j++) {
            if(M < L1[j] + L2[n-j]) {
                M = L1[j] + L2[n-j];
                k = j;
            }
        }

        delete [] L1;
        delete [] L2;

        return lcs(A.substr(0, i), B.substr(0, k)) + lcs(A.substr(i, m - i), B.substr(k, n - k));
    }
}

int* compute_help_table(const string &A, const string &B) {
    int m = A.size();
    int n = B.size();

    int* first = new int[n+1];
    int* second = new int[n+1];

    for(int i = 0; i <= n; i++) {
        second[i] = 0;
    }

    for(int i = 0; i < m; i++) {
        for(int k = 0; k <= n; k++) {
            first[k] = second[k];  
        }

        for(int j = 0; j < n; j++) {
            if(j == 0) {
                if (A[i] == B[j])
                    second[1] = 1;
            }
            else {
                if(A[i] == B[j]) {
                    second[j+1] = first[j] + 1;
                }
                else {
                    second[j+1] = max(second[j], first[j+1]);
                }
            }
        }
    }

    delete [] first;
    return second;
}

string simple_solution(const string & A, const string & B) {
    int i = 0;
    for(; i < B.size(); i++) {
        if(B.at(i) == A.at(0))
            return A;
    }

    return "";
}

Running here .

If the two strings share a common prefix (eg "ABCD" and "ABXY" share "AB") then that will be part of the LCS. Same for common suffixes. So for some pairs of strings you can gain some speed by skipping over the longest common prefix and longest common suffix before starting the DP algorithm; this doesn't change the worst-case bounds, but it changes the best case complexity to linear time and constant space.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM