How to apply longest common subsequence algorithm on large strings?

Question

How to apply longest common subsequence on bigger strings (600000 characters). Is there any way to do it in DP? I have done this for shorter strings.

#include <iostream>
#include <algorithm>
#include <cstring>
#include <cstdio>
using namespace std;

int dp[1005][1005];
char a[1005], b[1005];

int lcs(int x,int y)
{
    if(x==strlen(a)||y==strlen(b))
        return 0;
    if(dp[x][y]!=-1)
        return dp[x][y];
    else if(a[x]==b[y])
        dp[x][y]=1+lcs(x+1,y+1);
    else
        dp[x][y]=max(lcs(x+1,y),lcs(x,y+1));
    return dp[x][y];
}

int main()
{
    while(gets(a)&&gets(b))
    {
        memset(dp,-1,sizeof(dp));
        int ret=lcs(0,0);
        printf("%d\n",ret);
    }
}

Answer 1

You should take a look at this article which discusses the various design and implementation considerations. It is pointed out that you can look at Hirschberg's algorithm that finds optimal alignments between two strings using Edit distance (or Levenshtein distance). It can simplify the amount of space required on your behalf.

At the bottom you will find the "space-efficient LCS" defined thusly as a kind of mixed/pseudocode where m is the length of A and n is the length of B :

int lcs_length(char *A, char *B) {
  // Allocate storage for one-dimensional arrays X and Y.

  for (int i = m; i >= 0; i--) {
    for (int j = n; j >= 0; j--) {
      if (A[i] == '\0' || B[j] == '\0') {
        X[j] = 0;
      }
      else if (A[i] == B[j]) {
        X[j] = 1 + Y[j+1];
      }
      else {
        X[j] = max(Y[j], X[j+1]);
      }
    }

    // Copy contents of X into Y. Note that the "=" operator here
    // might not do what you expect. If Y and X are pointers then
    // it will assign the address and not copy the contents, so in
    // that case you'd do a memcpy. But they could be a custom
    // data type with an overridden "=" operator.
    Y = X;
  }

  return X[0];
}

If you are interested here is a paper about LCS on strings from large alphabets. Find algorithm Approx2LCS in section 3.2.

Answer 2

First, use bottom-up approach of dynamic programming:

// #includes and using namespace std;

const int SIZE = 1000;
int dp[SIZE + 1][SIZE + 1];
char a[SIZE + 1], b[SIZE + 1];

int lcs_bottomUp(){
    int strlenA = strlen(a), strlenB = strlen(b);
    for(int y = 0; y <= strlenB; y++)
        dp[strlenA][y] = 0;
    for(int x = strlenA - 1; x >= 0; x--){
        dp[x][strlenB] = 0;
        for(int y = strlenB - 1; y >= 0; y--)
            dp[x][y] = (a[x]==b[y]) ? 1 + dp[x+1][y+1] :
                    max(dp[x+1][y], dp[x][y+1]);
    }
    return dp[0][0];
}

int main(){
    while(gets(a) && gets(b)){
        printf("%d\n", lcs_bottomUp());
    }
}

Observe that you only need to keep 2 rows (or columns), one for dp[x] and another for dp[x + 1] :

// #includes and using namespace std;

const int SIZE = 1000;
int dp_x[SIZE + 1]; // dp[x]
int dp_xp1[SIZE + 1]; // dp[x + 1]
char a[SIZE + 1], b[SIZE + 1];

int lcs_bottomUp_2row(){
    int strlenA = strlen(a), strlenB = strlen(b);
    for(int y = 0; y <= strlenB; y++)
        dp_x[y] = 0; // assume x == strlenA
    for(int x = strlenA - 1; x >= 0; x--){
        // x has been decreased
        memcpy(dp_xp1, dp_x, sizeof(dp_x)); // dp[x + 1] <- dp[x]

        dp_x[strlenB] = 0;
        for(int y = strlenB - 1; y >= 0 ; y--)
            dp_x[y] = (a[x]==b[y]) ? 1 + dp_xp1[y+1] :
                    max(dp_xp1[y], dp_x[y+1]);
    }
    return dp_x[0]; // assume x == 0
}

int main(){
    while(gets(a) && gets(b)){
        printf("%d\n", lcs_bottomUp_2row());
    }
}

Now it's safe to change SIZE to 600000 .

Answer 3

As OP stated, the other answers are taking too much time, mainly due to the fact that for each outter iteration, 600000 characters are being copied. To improve it, one could, instead of physically changing column, change it logically. Thus:

int spaceEfficientLCS(std::string a, std::string b){
  int i, j, n = a.size(), m = b.size();

  // Size of columns is based on the size of the biggest string
  int maxLength = (n < m) ? m : n;
  int costs1[maxLength+1], costs2[maxLength+1];

  // Fill in data for costs columns
  for (i = 0; i <= maxLength; i++){
    costs1[i] = 0;
    costs2[i] = 0;
  }

  // Choose columns in a way that the return value will be costs1[0]
  int* mainCol, *secCol;
  if (n%2){
    mainCol = costs2;
    secCol = costs1;
  }
  else{
    mainCol = costs1;
    secCol = costs2;
  }

  // Compute costs
  for (i = n; i >= 0; i--){
    for (j = m; j >= 0; j--){
      if (a[i] == '\0' || b[j] == '\0') mainCol[j] = 0;
      else mainCol[j] = (a[i] == b[j]) ? secCol[j+1] + 1 :
                        std::max(secCol[j], mainCol[j+1]);
    }

    // Switch logic column
    int* aux = mainCol;
    mainCol = secCol;
    secCol = aux;
  }


  return costs1[0];
}

How to apply longest common subsequence algorithm on large strings?

Question

3 answers

solution1
3 2013-09-07 14:00:34

solution2
2 2013-09-07 13:59:55

solution3
1 2014-08-08 03:34:23

How to apply longest common subsequence algorithm on large strings?

Question

3 answers

solution1 3 2013-09-07 14:00:34

solution2 2 2013-09-07 13:59:55

solution3 1 2014-08-08 03:34:23

solution1
3 2013-09-07 14:00:34

solution2
2 2013-09-07 13:59:55

solution3
1 2014-08-08 03:34:23