简体   繁体   English

如何在大型字符串上应用最长的公共子序列算法?

[英]How to apply longest common subsequence algorithm on large strings?

How to apply longest common subsequence on bigger strings (600000 characters). 如何在更大的字符串(600000个字符)上应用最长的公共子序列。 Is there any way to do it in DP? 有什么办法可以在DP中完成? I have done this for shorter strings. 我这样做是为了缩短字符串。

#include <iostream>
#include <algorithm>
#include <cstring>
#include <cstdio>
using namespace std;

int dp[1005][1005];
char a[1005], b[1005];

int lcs(int x,int y)
{
    if(x==strlen(a)||y==strlen(b))
        return 0;
    if(dp[x][y]!=-1)
        return dp[x][y];
    else if(a[x]==b[y])
        dp[x][y]=1+lcs(x+1,y+1);
    else
        dp[x][y]=max(lcs(x+1,y),lcs(x,y+1));
    return dp[x][y];
}

int main()
{
    while(gets(a)&&gets(b))
    {
        memset(dp,-1,sizeof(dp));
        int ret=lcs(0,0);
        printf("%d\n",ret);
    }
}

You should take a look at this article which discusses the various design and implementation considerations. 您应该看一下这篇文章 ,其中讨论了各种设计和实现注意事项。 It is pointed out that you can look at Hirschberg's algorithm that finds optimal alignments between two strings using Edit distance (or Levenshtein distance). 需要指出的是,您可以查看Hirschberg算法 ,该算法使用“编辑距离”(或Levenshtein距离)在两个字符串之间找到最佳对齐方式。 It can simplify the amount of space required on your behalf. 它可以代表您简化所需的空间量。

At the bottom you will find the "space-efficient LCS" defined thusly as a kind of mixed/pseudocode where m is the length of A and n is the length of B : 在底部,您将找到“空间有效的LCS”,因此将其定义为一种混合/伪代码,其中mA的长度, nB的长度:

int lcs_length(char *A, char *B) {
  // Allocate storage for one-dimensional arrays X and Y.

  for (int i = m; i >= 0; i--) {
    for (int j = n; j >= 0; j--) {
      if (A[i] == '\0' || B[j] == '\0') {
        X[j] = 0;
      }
      else if (A[i] == B[j]) {
        X[j] = 1 + Y[j+1];
      }
      else {
        X[j] = max(Y[j], X[j+1]);
      }
    }

    // Copy contents of X into Y. Note that the "=" operator here
    // might not do what you expect. If Y and X are pointers then
    // it will assign the address and not copy the contents, so in
    // that case you'd do a memcpy. But they could be a custom
    // data type with an overridden "=" operator.
    Y = X;
  }

  return X[0];
}

If you are interested here is a paper about LCS on strings from large alphabets. 如果您对此感兴趣,请参阅有关大字母字符串的LCS 的论文 Find algorithm Approx2LCS in section 3.2. 在第3.2节中找到算法Approx2LCS

First, use bottom-up approach of dynamic programming: 首先,使用自底向上的动态编程方法:

// #includes and using namespace std;

const int SIZE = 1000;
int dp[SIZE + 1][SIZE + 1];
char a[SIZE + 1], b[SIZE + 1];

int lcs_bottomUp(){
    int strlenA = strlen(a), strlenB = strlen(b);
    for(int y = 0; y <= strlenB; y++)
        dp[strlenA][y] = 0;
    for(int x = strlenA - 1; x >= 0; x--){
        dp[x][strlenB] = 0;
        for(int y = strlenB - 1; y >= 0; y--)
            dp[x][y] = (a[x]==b[y]) ? 1 + dp[x+1][y+1] :
                    max(dp[x+1][y], dp[x][y+1]);
    }
    return dp[0][0];
}

int main(){
    while(gets(a) && gets(b)){
        printf("%d\n", lcs_bottomUp());
    }
}

Observe that you only need to keep 2 rows (or columns), one for dp[x] and another for dp[x + 1] : 请注意,您只需要保留2行(或列),其中一行用于dp[x] ,另一行用于dp[x + 1]

// #includes and using namespace std;

const int SIZE = 1000;
int dp_x[SIZE + 1]; // dp[x]
int dp_xp1[SIZE + 1]; // dp[x + 1]
char a[SIZE + 1], b[SIZE + 1];

int lcs_bottomUp_2row(){
    int strlenA = strlen(a), strlenB = strlen(b);
    for(int y = 0; y <= strlenB; y++)
        dp_x[y] = 0; // assume x == strlenA
    for(int x = strlenA - 1; x >= 0; x--){
        // x has been decreased
        memcpy(dp_xp1, dp_x, sizeof(dp_x)); // dp[x + 1] <- dp[x]

        dp_x[strlenB] = 0;
        for(int y = strlenB - 1; y >= 0 ; y--)
            dp_x[y] = (a[x]==b[y]) ? 1 + dp_xp1[y+1] :
                    max(dp_xp1[y], dp_x[y+1]);
    }
    return dp_x[0]; // assume x == 0
}

int main(){
    while(gets(a) && gets(b)){
        printf("%d\n", lcs_bottomUp_2row());
    }
}

Now it's safe to change SIZE to 600000 . 现在可以安全地将SIZE更改为600000

As OP stated, the other answers are taking too much time, mainly due to the fact that for each outter iteration, 600000 characters are being copied. 正如OP所述,其他答案花费了太多时间,主要是因为对于每次外部迭代,都复制了600000个字符。 To improve it, one could, instead of physically changing column, change it logically. 为了改进它,可以在逻辑上更改它,而不是物理上更改列。 Thus: 从而:

int spaceEfficientLCS(std::string a, std::string b){
  int i, j, n = a.size(), m = b.size();

  // Size of columns is based on the size of the biggest string
  int maxLength = (n < m) ? m : n;
  int costs1[maxLength+1], costs2[maxLength+1];

  // Fill in data for costs columns
  for (i = 0; i <= maxLength; i++){
    costs1[i] = 0;
    costs2[i] = 0;
  }

  // Choose columns in a way that the return value will be costs1[0]
  int* mainCol, *secCol;
  if (n%2){
    mainCol = costs2;
    secCol = costs1;
  }
  else{
    mainCol = costs1;
    secCol = costs2;
  }

  // Compute costs
  for (i = n; i >= 0; i--){
    for (j = m; j >= 0; j--){
      if (a[i] == '\0' || b[j] == '\0') mainCol[j] = 0;
      else mainCol[j] = (a[i] == b[j]) ? secCol[j+1] + 1 :
                        std::max(secCol[j], mainCol[j+1]);
    }

    // Switch logic column
    int* aux = mainCol;
    mainCol = secCol;
    secCol = aux;
  }


  return costs1[0];
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM