简体   繁体   中英

Runtime of KMP algorithm and LPS table construction

I recently came across the KMP algorithm, and I have spent a lot of time trying to understand why it works. While I do understand the basic functionality now, I simply fail to understand the runtime computations.

I have taken the below code from the geeksForGeeks site: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/

This site claims that if the text size is O(n) and pattern size is O(m), then KMP computes a match in max O(n) time. It also states that the LPS array can be computed in O(m) time.

// C++ program for implementation of KMP pattern searching 
// algorithm 
#include <bits/stdc++.h> 

void computeLPSArray(char* pat, int M, int* lps); 

// Prints occurrences of txt[] in pat[] 
void KMPSearch(char* pat, char* txt) 
{ 
    int M = strlen(pat); 
    int N = strlen(txt); 

    // create lps[] that will hold the longest prefix suffix 
    // values for pattern 
    int lps[M]; 

    // Preprocess the pattern (calculate lps[] array) 
    computeLPSArray(pat, M, lps); 

    int i = 0; // index for txt[] 
    int j = 0; // index for pat[] 
    while (i < N) { 
        if (pat[j] == txt[i]) { 
            j++; 
            i++; 
        } 

        if (j == M) { 
            printf("Found pattern at index %d ", i - j); 
            j = lps[j - 1]; 
        } 

        // mismatch after j matches 
        else if (i < N && pat[j] != txt[i]) { 
            // Do not match lps[0..lps[j-1]] characters, 
            // they will match anyway 
            if (j != 0) 
                j = lps[j - 1]; 
            else
                i = i + 1; 
        } 
    } 
}

// Fills lps[] for given patttern pat[0..M-1] 
void computeLPSArray(char* pat, int M, int* lps) 
{ 
    // length of the previous longest prefix suffix 
    int len = 0; 

    lps[0] = 0; // lps[0] is always 0 

    // the loop calculates lps[i] for i = 1 to M-1 
    int i = 1; 
    while (i < M) { 
        if (pat[i] == pat[len]) { 
            len++; 
            lps[i] = len; 
            i++; 
        } 
        else // (pat[i] != pat[len]) 
        { 
            // This is tricky. Consider the example. 
            // AAACAAAA and i = 7. The idea is similar 
            // to search step. 
            if (len != 0) { 
                len = lps[len - 1]; 

                // Also, note that we do not increment 
                // i here 
            } 
            else // if (len == 0) 
            { 
                lps[i] = 0; 
                i++; 
            } 
        } 
    } 
} 

// Driver program to test above function 
int main() 
{ 
    char txt[] = "ABABDABACDABABCABAB"; 
    char pat[] = "ABABCABAB"; 
    KMPSearch(pat, txt); 
    return 0; 
}

I am really confused why that is the case.

For LPS computation, consider: aaaaacaaac In this case, when we try to compute LPS for the first c, we would keep going back until we hit LPS[0], which is 0 and stop. So, essentially, we would travel back atleast the length of the pattern until that point. If this happens multiple times, how will time complexity be O(m)?

I have similar confusion on runtime of KMP to be O(n).

I have read other threads in stack overflow before posting, and also various other sites on the topic. I am still very confused. I would really appreciate if someone can help me understand the best and worse case scenarios for these algorithms and how their runtime is computed using some examples. Again, please don't suggest I google this, I have done it, spent a whole week trying to gain any insight, and failed.

One way to establish an upper bound on the runtime for construction of the LPS array is to consider a pathological case - how can we maximize the number of times we have to execute len = lps[len - 1]? Consider the following string, ignoring spaces: x1 x2 x1x3 x1x2x1x4 x1x2x1x3x1x2x1x5 ...

The second term needs to be compared to the first term as if it ended in 1 instead of 2, it would match the first term. Similarly the third term needs to be compared to the first two terms as if it ended in 1 or 2 instead of 3, it would match those partial terms. And so forth.

In the example string, it is clear that only every 1/2^n characters can match n times, so the total runtime will be m+m/2+m/4+..=2m=O(m), the length of the pattern string. I suspect it's impossible to construct a string with worse runtime than the example string and this can probably be formally proven.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM