简体   繁体   English

如何使用C ++查找最长的公共子字符串

[英]How to find Longest Common Substring using C++

I searched online for a C++ Longest Common Substring implementation but failed to find a decent one. 我在网上搜索了C ++ Longest Common Substring实现,但是找不到合适的实现。 I need a LCS algorithm that returns the substring itself, so it's not just LCS. 我需要一个返回子字符串本身的LCS算法,所以不仅仅是LCS。

I was wondering, though, about how I can do this between multiple strings. 但是,我想知道如何在多个字符串之间执行此操作。

My idea was to check the longest one between 2 strings, and then go check all the others, but this is a very slow process which requires managing many long strings on the memory, making my program quite slow. 我的想法是检查2个字符串中最长的一个,然后再检查所有其他字符串,但这是一个非常缓慢的过程,需要在内存上管理许多长字符串,这使我的程序非常慢。

Any idea of how this can be speeded up for multiple strings? 是否知道如何加快多个字符串的速度? Thank you. 谢谢。

Important Edit One of the variables I'm given determines the number of strings the longest common substring needs to be in, so I can be given 10 strings, and find the LCS of them all (K=10), or LCS of 4 of them, but I'm not told which 4, I have to find the best 4. 重要说明我得到的变量之一确定了最长的公共子字符串需要放入的字符串数,因此可以给我10个字符串,并找出所有的LCS(K = 10),或者找到4个LCS。他们,但我没有被告知哪个4,我必须找到最好的4。

The answer is GENERALISED SUFFIX TREE. 答案是广义后缀树。 http://en.wikipedia.org/wiki/Generalised_suffix_tree http://en.wikipedia.org/wiki/Generalised_suffix_tree

You can build a generalised suffix tree with multiple string. 您可以使用多个字符串构建通用后缀树。

Look at this http://en.wikipedia.org/wiki/Longest_common_substring_problem 看看这个http://en.wikipedia.org/wiki/Longest_common_substring_problem

The Suffix tree can be built in O(n) time for each string, k*O(n) in total. 可以为每个字符串在O(n)时间内构建后缀树,总计为k * O(n)。 K is total number of strings. K是字符串总数。

So it's very quick to solve this problem. 因此,解决这个问题很快。

这是一篇很好的文章,它有效地找到了所有常见的子字符串 ,并带有C语言中的示例。如果只需要最长的时间,这可能会过高,但是比后缀树的一般文章更容易理解。

This is a dynamic programming problem and can be solved in O(mn) time, where m is the length of one string and n is of other. 这是一个动态编程问题,可以在O(mn)时间内解决,其中m是一个字符串的长度,而n是另一个字符串的长度。

Like any other problem solved using Dynamic Programming, we will divide the problem into subproblem. 像使用动态编程解决的任何其他问题一样,我们会将问题分为子问题。 Lets say if two strings are x1x2x3....xm and y1y2y3...yn 假设两个字符串是x1x2x3 .... xm和y1y2y3 ... yn

S(i,j) is the longest common string for x1x2x3...xi and y1y2y3....yj, then S(i,j)是x1x2x3 ... xi和y1y2y3 .... yj的最长公共字符串

S(i,j) = max { length of longest common substring ending at xi/yj, if ( x[i] == y[j] ), S(i-1, j-1), S(i, j-1), S(i-1, j) } S(i,j)= max {如果(x [i] == y [j]),S(i-1,j-1),S(i,j -1),S(i-1,j)}

Here is working program in Java. 这是Java中的工作程序。 I am sure you can convert it to C++.: 我相信您可以将其转换为C ++。

public class LongestCommonSubstring {

    public static void main(String[] args) {
        String str1 = "abcdefgijkl";
        String str2 = "mnopabgijkw";
        System.out.println(getLongestCommonSubstring(str1,str2));
    }

    public static String getLongestCommonSubstring(String str1, String str2) {
        //Note this longest[][] is a standard auxialry memory space used in Dynamic
                //programming approach to save results of subproblems. 
                //These results are then used to calculate the results for bigger problems
        int[][] longest = new int[str2.length() + 1][str1.length() + 1];
        int min_index = 0, max_index = 0;

                //When one string is of zero length, then longest common substring length is 0
        for(int idx = 0; idx < str1.length() + 1; idx++) {
            longest[0][idx] = 0;
        }

        for(int idx = 0; idx < str2.length() + 1; idx++) {
            longest[idx][0] = 0;
        }

        for(int i = 0; i <  str2.length(); i++) {
            for(int j = 0; j < str1.length(); j++) {

                int tmp_min = j, tmp_max = j, tmp_offset = 0;

                if(str2.charAt(i) == str1.charAt(j)) {
                    //Find length of longest common substring ending at i/j
                    while(tmp_offset <= i && tmp_offset <= j &&
                            str2.charAt(i - tmp_offset) == str1.charAt(j - tmp_offset)) {

                        tmp_min--;
                        tmp_offset++;

                    }
                }
                //tmp_min will at this moment contain either < i,j value or the index that does not match
                //So increment it to the index that matches.
                tmp_min++;

                //Length of longest common substring ending at i/j
                int length = tmp_max - tmp_min + 1;
                //Find the longest between S(i-1,j), S(i-1,j-1), S(i, j-1)
                int tmp_max_length = Math.max(longest[i][j], Math.max(longest[i+1][j], longest[i][j+1]));

                if(length > tmp_max_length) {
                    min_index = tmp_min;
                    max_index = tmp_max;
                    longest[i+1][j+1] = length;
                } else {
                    longest[i+1][j+1] = tmp_max_length;
                }


            }
        }

        return str1.substring(min_index, max_index >= str1.length() - 1 ? str1.length() - 1 : max_index + 1);
    }
}

There is a very elegant Dynamic Programming solution to this. 有一个非常优雅的动态编程解决方案。

Let LCSuff[i][j] be the longest common suffix between X[1..m] and Y[1..n] . LCSuff[i][j]X[1..m]Y[1..n]之间的最长公共后缀。 We have two cases here: 这里有两种情况:

  • X[i] == Y[j] , that means we can extend the longest common suffix between X[i-1] and Y[j-1] . X[i] == Y[j] ,这意味着我们可以扩展X[i-1]Y[j-1]之间的最长公共后缀。 Thus LCSuff[i][j] = LCSuff[i-1][j-1] + 1 in this case. 因此,在这种情况下, LCSuff[i][j] = LCSuff[i-1][j-1] + 1

  • X[i] != Y[j] , since the last characters themselves are different, X[1..i] and Y[1..j] can't have a common suffix. X[i] != Y[j] ,因为最后一个字符本身不同,所以X[1..i]Y[1..j]不能有共同的后缀。 Hence, LCSuff[i][j] = 0 in this case. 因此,在这种情况下, LCSuff[i][j] = 0

We now need to check maximal of these longest common suffixes. 现在,我们需要检查这些最长的通用后缀的最大值。

So, LCSubstr(X,Y) = max(LCSuff(i,j)) , where 1<=i<=m and 1<=j<=n 因此, LCSubstr(X,Y) = max(LCSuff(i,j)) ,其中1<=i<=m1<=j<=n

The algorithm pretty much writes itself now. 该算法现在几乎可以自行编写。

string LCSubstr(string x, string y){
    int m = x.length(), n=y.length();

    int LCSuff[m][n];

    for(int j=0; j<=n; j++)
        LCSuff[0][j] = 0;
    for(int i=0; i<=m; i++)
        LCSuff[i][0] = 0;

    for(int i=1; i<=m; i++){
        for(int j=1; j<=n; j++){
            if(x[i-1] == y[j-1])
                LCSuff[i][j] = LCSuff[i-1][j-1] + 1;
            else
                LCSuff[i][j] = 0;
        }
    }

    string longest = "";
    for(int i=1; i<=m; i++){
        for(int j=1; j<=n; j++){
            if(LCSuff[i][j] > longest.length())
                longest = x.substr((i-LCSuff[i][j]+1) -1, LCSuff[i][j]);
        }
    }
    return longest;
}

Here is a C# version to find the Longest Common Substring using dynamic programming of two arrays (you may refer to: http://codingworkout.blogspot.com/2014/07/longest-common-substring.html for more details) 这是使用两个数组的动态编程来查找最长公共子字符串的C#版本(有关更多详细信息,您可以参考: http : //codingworkout.blogspot.com/2014/07/longest-common-substring.html

class LCSubstring
        {
            public int Length = 0;
            public List<Tuple<int, int>> indices = new List<Tuple<int, int>>();
        }
        public string[] LongestCommonSubStrings(string A, string B)
        {
            int[][] DP_LCSuffix_Cache = new int[A.Length+1][];
            for (int i = 0; i <= A.Length; i++)
            {
                DP_LCSuffix_Cache[i] = new int[B.Length + 1];
            }
            LCSubstring lcsSubstring = new LCSubstring();
            for (int i = 1; i <= A.Length; i++)
            {
                for (int j = 1; j <= B.Length; j++)
                {
                    //LCSuffix(Xi, Yj) = 0 if X[i] != X[j]
                    //                 = LCSuffix(Xi-1, Yj-1) + 1 if Xi = Yj
                    if (A[i - 1] == B[j - 1])
                    {
                        int lcSuffix = 1 + DP_LCSuffix_Cache[i - 1][j - 1];
                        DP_LCSuffix_Cache[i][j] = lcSuffix;
                        if (lcSuffix > lcsSubstring.Length)
                        {
                            lcsSubstring.Length = lcSuffix;
                            lcsSubstring.indices.Clear();
                            var t = new Tuple<int, int>(i, j);
                            lcsSubstring.indices.Add(t);
                        }
                        else if(lcSuffix == lcsSubstring.Length)
                        {
                            //may be more than one longest common substring
                            lcsSubstring.indices.Add(new Tuple<int, int>(i, j));
                        }
                    }
                    else
                    {
                        DP_LCSuffix_Cache[i][j] = 0;
                    }
                }
            }
            if(lcsSubstring.Length > 0)
            {
                List<string> substrings = new List<string>();
                foreach(Tuple<int, int> indices in lcsSubstring.indices)
                {
                    string s = string.Empty;
                    int i = indices.Item1 - lcsSubstring.Length;
                    int j = indices.Item2 - lcsSubstring.Length;
                    Assert.IsTrue(DP_LCSuffix_Cache[i][j] == 0);
                    for(int l =0; l<lcsSubstring.Length;l++)
                    {
                        s += A[i];
                        Assert.IsTrue(A[i] == B[j]);
                        i++;
                        j++;
                    }
                    Assert.IsTrue(i == indices.Item1);
                    Assert.IsTrue(j == indices.Item2);
                    Assert.IsTrue(DP_LCSuffix_Cache[i][j] == lcsSubstring.Length);
                    substrings.Add(s);
                }
                return substrings.ToArray();
            }
            return new string[0];
        }

Where unit tests are: 单元测试在哪里:

[TestMethod]
        public void LCSubstringTests()
        {
            string A = "ABABC", B = "BABCA";
            string[] substrings = this.LongestCommonSubStrings(A, B);
            Assert.IsTrue(substrings.Length == 1);
            Assert.IsTrue(substrings[0] == "BABC");
            A = "ABCXYZ"; B = "XYZABC";
            substrings = this.LongestCommonSubStrings(A, B);
            Assert.IsTrue(substrings.Length == 2);
            Assert.IsTrue(substrings.Any(s => s == "ABC"));
            Assert.IsTrue(substrings.Any(s => s == "XYZ"));
            A = "ABC"; B = "UVWXYZ";
            string substring = "";
            for(int i =1;i<=10;i++)
            {
                A += i;
                B += i;
                substring += i;
                substrings = this.LongestCommonSubStrings(A, B);
                Assert.IsTrue(substrings.Length == 1);
                Assert.IsTrue(substrings[0] == substring);
            }
        }

I tried several different solutions for this but they all seemed really slow so I came up with the below, didn't really test much, but it seems to work a bit faster for me. 我为此尝试了几种不同的解决方案,但它们似乎都非常慢,因此我提出了以下建议,并没有进行太多测试,但对我来说似乎工作更快。

#include <iostream>

std::string lcs( std::string a, std::string b )
{
    if( a.empty() || b.empty() ) return {} ;

    std::string current_lcs = "";

    for(int i=0; i< a.length(); i++) {
        size_t fpos = b.find(a[i], 0);
        while(fpos != std::string::npos) {
            std::string tmp_lcs = "";
            tmp_lcs += a[i];
            for (int x = fpos+1; x < b.length(); x++) {
                tmp_lcs+=b[x];
                size_t spos = a.find(tmp_lcs, 0);
                if (spos == std::string::npos) {
                    break;
                } else {
                    if (tmp_lcs.length() > current_lcs.length()) {
                        current_lcs = tmp_lcs;
                    }
                }
            }
            fpos = b.find(a[i], fpos+1);
        }
    }
    return current_lcs;
}

int main(int argc, char** argv)
{
    std::cout << lcs(std::string(argv[1]), std::string(argv[2])) << std::endl;
}

Find the largest substring from all strings under consideration. 从正在考虑的所有字符串中找到最大的子字符串。 From N strings, you'll have N substrings. 在N个字符串中,您将有N个子字符串。 Choose the largest of those N. 选择这些N中最大的一个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM