简体   繁体   中英

Optimisation ideas - Longest common substring

I have this program which is supposed to find the Longest Common Substring of a number of strings. Which it does, but if the strings are very long (ie >8000 characters long), it works slowly (1.5 seconds). Is there any way to optimise that?

The program is this:

//#include "stdafx.h"
#include <iostream>
#include <string>
#include <vector>
#include <cassert>


using namespace std;

const unsigned short MAX_STRINGS = 10;
const unsigned int  MAX_SIZE=10000;
vector<string> strings;
unsigned int len;

string GetLongestCommonSubstring( string string1, string string2 );
inline void readNumberSubstrings();
inline const string getMaxSubstring();

void readNumberSubstrings()
{
    cin >> len;

    assert(len > 1 && len <=MAX_STRINGS);

    strings.resize(len);

    for(register unsigned int i=0; i<len;i++)
        strings[i]=string(MAX_SIZE,0);

    for(register unsigned int i=0; i<len; i++)
        cin>>strings[i];
}

 const string getMaxSubstring()
{
    string maxSubstring=strings[0];
    for(register unsigned int i=1; i < len; i++)
        maxSubstring=GetLongestCommonSubstring(maxSubstring, strings[i]);
    return maxSubstring;
}

string GetLongestCommonSubstring( string string1, string string2 ) 
{

    const int solution_size = string2.length()+ 1;

    int *x=new int[solution_size]();
    int *y= new int[solution_size]();

    int **previous = &x;
    int **current = &y;

    int max_length = 0;
    int result_index = 0;

    int j;
    int length;
    int M=string2.length() - 1;

    for(register int i = string1.length() - 1; i >= 0; i--)
    {
        for(register int j = M; j >= 0; j--) 
        {
            if(string1[i] != string2[j]) 
                (*current)[j] = 0;
            else 
            {
                length = 1 + (*previous)[j + 1];
                if (length > max_length)
                {
                    max_length = length;
                    result_index = i;
                }

                (*current)[j] = length;
            }
        }

        swap(previous, current);
    }
    string1[max_length+result_index]='\0';
    return &(string1[result_index]);
}

int main()
{
    readNumberSubstrings();
    cout << getMaxSubstring() << endl;
    return 0;
}

Note : there is a reason why I didn't write code that would solve this problem with suffix trees (they're large).

Often when it comes to optimization, a different approach might be your only true option rather than trying to incrementally improve the current implementation. Here's my idea:

  • create a list of valid characters that might appear in the longest common substring. Ie, if a character doesn't appear in all strings, it can't be part of the longest common substring.

  • separate each string into multiple strings containing only valid characters

  • for every such string, create every possible substring and add it to the list as well

  • filter (as with the characters) all strings, that don't show up in all lists.

The complexity of this obviously depends largely on the number of invalid characters. if it's zero, this approach doesn't help at all.

Some remarks on your code: Don't try to be overly clever. The compiler will optimize so much, there's really no need for you to put register in your code. Second, your allocating strings and then overwrite them (in readNumberSubstrings ), that's totally unnecessary. Third, pass by const reference if you can. Fourth, don't use raw pointers, especially if you never delete [] your new [] d objects. Use std::vector s instead, it behaves well with exceptions (which you might encounter, you're using strings a lot!).

You have to use suffix tree. This struct will make algorithm, which work about 1 second for 10 string with 10000 symbols.

Give a Suffix Arraya try, they take as much memory as your input strings (depending on your text encoding though) and a built quickly in linear time.

http://en.wikipedia.org/wiki/Suffix_array

Here is my JavaScript code for this

function LCS(as, bs, A, B) {
    var a = 0, b = 0, R = [], max = 1
    while (a < A.length && b < B.length) {
        var M = cmpAt(as, bs, A[a], B[b])
        if (M.size > 0) {
            if (M.ab < 0) {
                var x = b; while (x < B.length) {
                    var C = cmpAt(as, bs, A[a], B[x])
                    if (C.size >= M.size) {  if (C.size >= max) max = C.size, R.push([a, x, C.size]) } else break
                    x++
                }
            } else {
                var x = a; while (x < A.length) {
                    var C = cmpAt(as, bs, A[x], B[b])
                    if (C.size >= M.size) { if (C.size >= max) max = C.size, R.push([x, b, C.size]) } else break
                    x++
                }
            }
        }
        if (M.ab < 0) a++; else b++
    }
    R = R.filter(function(a){ if (a[2] == max) return true })
    return R
}

function cmpAt(a, b, x, y) {
    var c = 0
    while (true) {
        if (x == a.length) {
            if (y == b.length) return { size: c, ab: 0 }
            return { size: c, ab: -1 }
        }
        if (y == b.length) return { size: c, ab: 1 }
        if (a.charCodeAt(x) != b.charCodeAt(y)) {
            var ab = 1; 
            if (a.charCodeAt(x) < b.charCodeAt(y)) ab = -1
            return { size: c, ab: ab }
        }
        c++, x++, y++
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM