简体   繁体   English

使用后缀数组实现最长的公共子字符串

[英]Implementing Longest Common Substring using Suffix Array

I am using this program for computing the suffix array and the Longest Common Prefix. 我正在使用此程序来计算后缀数组和最长公共前缀。

I am required to calculate the longest common substring between two strings. 我需要计算两个字符串之间最长的公共子字符串。

For that, I concatenate strings, A#B and then use this algorithm . 为此,我将字符串A#B连接起来,然后使用此算法

I have Suffix Array sa[] and the LCP[] array. 我有后缀数组sa[]LCP[]数组。

The the longest common substring is the max value of LCP[] array. 最长的公共子字符串是LCP[]数组的最大值。

In order to find the substring, the only condition is that among substrings of common lengths, the one occurring the first time in string B should be the answer. 为了找到子字符串,唯一的条件是,在具有共同长度的子字符串中,字符串B中第一次出现的子字符串应该是答案。

For that, I maintain max of the LCP[]. 为此,我要保持LCP []的最大值。 If LCP[curr_index] == max , then I make sure that the left_index of the substring B is smaller than the previous value of left_index . 如果LCP[curr_index] == max ,那么我确保子字符串B的left_index小于left_index的先前值。

However, this approach is not giving a right answer. 但是,这种方法不能给出正确的答案。 Where is the fault? 哪里错了?

max=-1;
for(int i=1;i<strlen(S)-1;++i)
{
    //checking that sa[i+1] occurs after s[i] or not
    if(lcp[i] >= max && sa[i] < l1 && sa[i+1] >= l1+1 )
    {
        if( max == lcp[i] && sa[i+1] < left_index ) left_index=sa[i+1];

        else if (lcp[i] > ma )
        {
            left_index=sa[i+1];
            max=lcp[i];
        }
    }
    //checking that sa[i+1] occurs after s[i] or not
    else if (lcp[i] >= max && sa[i] >= l1+1 && sa[i+1] < l1 )
    {
        if( max == lcp[i] && sa[i] < left_index) left_index=sa[i];

        else if (lcp[i]>ma)
        {
            left_index=sa[i];
            max=lcp[i];
        }
    }
}

AFAIK, This problem is from a programming contest and discussing about programming problems of ongoing contest before editorials have been released shouldn't be .... Although I am giving you some insights as I got Wrong Answer with suffix array. AFAIK,这个问题来自编程竞赛,并且在发行社论之前讨论正在进行的竞赛的编程问题不应该是...。尽管我为您提供了一些见解,因为我得到了带有后缀数组的错误答案 Then I used suffix Automaton which gives me Accepted. 然后,我使用后缀Automaton ,这使我被接受。

Suffix array works in O(nlog^2 n) whereas Suffix Automaton works in O(n) . 后缀数组在O(nlog^2 n)工作,而后缀自动机在O(n) So my advice is go with suffix Automaton and you will surely get Accepted. 因此,我的建议是使用后缀Automaton,您一定会被接受。 And if you can code solution for that problem , you will surely code this. 并且,如果您可以为该问题编码解决方案 ,那么您一定会对此进行编码。

Also found in codchef forum that: 在codchef论坛中还发现:

Try this case 
babaazzzzyy 
badyybac 
The suffix array will contain baa... (From 1st string ) , baba.. ( from first string ) , bac ( from second string ) , bad from second string .
So if you are examining consecutive entries of SA then you will find a match at "baba" and "bac" and find the index of "ba" as 7 in second string , even though its actually at index 1 also . 
Its likely that you may output "yy" instead of "ba"

And also handling the constraint ...the first longest common substring to be found on the second string, should be written to output... would be very easy in case of suffix automaton. 并且还要处理约束...在第二个字符串上找到的第一个最长的公共子字符串应写入输出...在后缀自动机的情况下将非常容易。 Best of luck! 祝你好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM