简体   繁体   English

优化思路-最长的通用子字符串

[英]Optimisation ideas - Longest common substring

I have this program which is supposed to find the Longest Common Substring of a number of strings. 我有这个程序,应该找到许多字符串的最长公共子字符串。 Which it does, but if the strings are very long (ie >8000 characters long), it works slowly (1.5 seconds). 可以,但是如果字符串很长(即,> 8000个字符长),则它会缓慢运行(1.5秒)。 Is there any way to optimise that? 有什么办法可以优化吗?

The program is this: 程序是这样的:

//#include "stdafx.h"
#include <iostream>
#include <string>
#include <vector>
#include <cassert>


using namespace std;

const unsigned short MAX_STRINGS = 10;
const unsigned int  MAX_SIZE=10000;
vector<string> strings;
unsigned int len;

string GetLongestCommonSubstring( string string1, string string2 );
inline void readNumberSubstrings();
inline const string getMaxSubstring();

void readNumberSubstrings()
{
    cin >> len;

    assert(len > 1 && len <=MAX_STRINGS);

    strings.resize(len);

    for(register unsigned int i=0; i<len;i++)
        strings[i]=string(MAX_SIZE,0);

    for(register unsigned int i=0; i<len; i++)
        cin>>strings[i];
}

 const string getMaxSubstring()
{
    string maxSubstring=strings[0];
    for(register unsigned int i=1; i < len; i++)
        maxSubstring=GetLongestCommonSubstring(maxSubstring, strings[i]);
    return maxSubstring;
}

string GetLongestCommonSubstring( string string1, string string2 ) 
{

    const int solution_size = string2.length()+ 1;

    int *x=new int[solution_size]();
    int *y= new int[solution_size]();

    int **previous = &x;
    int **current = &y;

    int max_length = 0;
    int result_index = 0;

    int j;
    int length;
    int M=string2.length() - 1;

    for(register int i = string1.length() - 1; i >= 0; i--)
    {
        for(register int j = M; j >= 0; j--) 
        {
            if(string1[i] != string2[j]) 
                (*current)[j] = 0;
            else 
            {
                length = 1 + (*previous)[j + 1];
                if (length > max_length)
                {
                    max_length = length;
                    result_index = i;
                }

                (*current)[j] = length;
            }
        }

        swap(previous, current);
    }
    string1[max_length+result_index]='\0';
    return &(string1[result_index]);
}

int main()
{
    readNumberSubstrings();
    cout << getMaxSubstring() << endl;
    return 0;
}

Note : there is a reason why I didn't write code that would solve this problem with suffix trees (they're large). 注意 :我之所以没有写代码来解决后缀树(它们很大)是有原因的。

Often when it comes to optimization, a different approach might be your only true option rather than trying to incrementally improve the current implementation. 通常,在进行优化时,您可能唯一的选择就是采用其他方法,而不是尝试逐步改进当前的实现。 Here's my idea: 这是我的主意:

  • create a list of valid characters that might appear in the longest common substring. 创建可能出现在最长的公共子字符串中的有效字符列表。 Ie, if a character doesn't appear in all strings, it can't be part of the longest common substring. 即,如果字符未出现在所有字符串中,则该字符不能成为最长公共子字符串的一部分。

  • separate each string into multiple strings containing only valid characters 将每个字符串分成仅包含有效字符的多个字符串

  • for every such string, create every possible substring and add it to the list as well 对于每个这样的字符串,创建每个可能的子字符串,并将其也添加到列表中

  • filter (as with the characters) all strings, that don't show up in all lists. 过滤(与字符一样)所有不在所有列表中显示的字符串。

The complexity of this obviously depends largely on the number of invalid characters. 显然,其复杂性在很大程度上取决于无效字符的数量。 if it's zero, this approach doesn't help at all. 如果为零,则此方法完全没有帮助。

Some remarks on your code: Don't try to be overly clever. 关于您的代码的一些评论:不要试图变得过于聪明。 The compiler will optimize so much, there's really no need for you to put register in your code. 编译器将进行大量优化,实际上不需要您在代码中添加register Second, your allocating strings and then overwrite them (in readNumberSubstrings ), that's totally unnecessary. 其次,分配字符串然后覆盖它们(在readNumberSubstrings ),这完全没有必要。 Third, pass by const reference if you can. 第三,如果可以的话,通过const引用传递。 Fourth, don't use raw pointers, especially if you never delete [] your new [] d objects. 第四,不要使用原始指针,特别是如果您从不delete [] new [] d对象。 Use std::vector s instead, it behaves well with exceptions (which you might encounter, you're using strings a lot!). 使用std::vector代替,它在异常情况下表现良好(您可能会遇到,您经常使用字符串!)。

You have to use suffix tree. 您必须使用后缀树。 This struct will make algorithm, which work about 1 second for 10 string with 10000 symbols. 该结构将生成算法,该算法对于带有10000个符号的10个字符串大约工作1秒。

Give a Suffix Arraya try, they take as much memory as your input strings (depending on your text encoding though) and a built quickly in linear time. 尝试使用后缀Arraya,它们占用的内存与输入字符串一样多(尽管取决于您的文本编码),并且可以在线性时间内快速构建。

http://en.wikipedia.org/wiki/Suffix_array http://en.wikipedia.org/wiki/Suffix_array

Here is my JavaScript code for this 这是我的JavaScript代码

function LCS(as, bs, A, B) {
    var a = 0, b = 0, R = [], max = 1
    while (a < A.length && b < B.length) {
        var M = cmpAt(as, bs, A[a], B[b])
        if (M.size > 0) {
            if (M.ab < 0) {
                var x = b; while (x < B.length) {
                    var C = cmpAt(as, bs, A[a], B[x])
                    if (C.size >= M.size) {  if (C.size >= max) max = C.size, R.push([a, x, C.size]) } else break
                    x++
                }
            } else {
                var x = a; while (x < A.length) {
                    var C = cmpAt(as, bs, A[x], B[b])
                    if (C.size >= M.size) { if (C.size >= max) max = C.size, R.push([x, b, C.size]) } else break
                    x++
                }
            }
        }
        if (M.ab < 0) a++; else b++
    }
    R = R.filter(function(a){ if (a[2] == max) return true })
    return R
}

function cmpAt(a, b, x, y) {
    var c = 0
    while (true) {
        if (x == a.length) {
            if (y == b.length) return { size: c, ab: 0 }
            return { size: c, ab: -1 }
        }
        if (y == b.length) return { size: c, ab: 1 }
        if (a.charCodeAt(x) != b.charCodeAt(y)) {
            var ab = 1; 
            if (a.charCodeAt(x) < b.charCodeAt(y)) ab = -1
            return { size: c, ab: ab }
        }
        c++, x++, y++
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM