简体   繁体   中英

DFA construction in Knuth-Morris-Pratt algorithm

I am referring to the outline of the Knuth-Morris-Pratt (KMP) algorithm for substring search in Sedgewick's book "Algorithms" (4th ed.).

The KMP algorithm uses a backup in substring search based on a deterministic finite automaton (DFA). I understand how the DFA enters the algorithm, however I don't understand how to construct the DFA, which is done by the following code snippet:

dfa[pat.charAt(0)][0] = 1;
for (int X = 0; j = 1; j< M; j++) {
    for (int c = 0; c < R; c++) {
       dfa[c][j] = dfa[c][X];
    }
    dfa[pat.charAt(j)][j] = j+1;
    X = dfa[pat.charAt(j)][X];
}

where M is the the length of the pattern pat and R the size of the alphabet. The charAt() function returns the integer value of the character at the respective position.

Can someone explain in what way this piece of code constructs a dfa? I am lost in the actual intuitive meaning of the inner for-loop.

Let us take a look at the following FA for pattern ACACAGA.

在此输入图像描述

在此输入图像描述

The above diagrams represent graphical and tabular representations of pattern ACACAGA.

Here, number of states in DFA is M + 1 where M is length of the pattern. The main thing to construct DFA is to get the next state from the current state for every possible character. Given a character x and a state k, we can get the next state by considering the string “pat[0..k-1]x” which is basically concatenation of pattern characters pat[0], pat 1 … pat[k-1] and the character x. The idea is to get length of the longest prefix of the given pattern such that the prefix is also suffix (LPS) of “pat[0..k-1]x”. The value of length gives us the next state.

For example, let us see how to get the next state from current state 5 and character 'C' in the above diagram. We need to consider the string, “pat[0..5]C” which is “ACACAC”. The lenght of the longest prefix of the pattern such that the prefix is suffix of “ACACAC”is 4 (“ACAC”). So the next state (from state 5) is 4 for character 'C'.

// dfa[i][j] = k denotes the transition function will go k'th state 
// with character i from state j

// DFA will go always (i + 1)'th state from i'th state 
//if the character c is equal to current character of pattern 
dfa[pat.charAt(0)][0] = 1;

//  here X is initialized with LPS (longest prefix suffix) 
// of pattern[0....j - 1]. LPS[0] is always 0
for (int X = 0; j = 1; j< M; j++) {
    for (int c = 0; c < R; c++) { // for all possible characters
        // transition function from j'th state taking character c 
        // will go to the same state where it went from X(LPS)'th state
        // taking character c (justify it with an example) 
        dfa[c][j] = dfa[c][X];
    }
    // DFA will go always (i + 1)th state from i'th state if 
    // the character c is equal to current character of pattern
    dfa[pat.charAt(j)][j] = j + 1;
    X = dfa[pat.charAt(j)][X]; // update LPS
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM