简体   繁体   中英

string matching with finite automata

I am reading about string algorithms in Cormen's book "Introduction to Algorithms". For Transition which is shown below.

My question: why are we doing min(m+1, q+2) and why are we incrementing m by 1 and q by 2.

Following link has back ground to above question.

http://people.scs.carleton.ca/~maheshwa/courses/5703COMP/Fall2009/StringMatching.pdf

Kindly help here with a simple example.

Algorithm Compute-Transition-Function(P, Sigma)
m = length(P);
for  q = 0 through m  do
   for each character  x  in Sigma
       k = min(m+1, q+2);
       repeat  k = k-1  // work backwards from q+1
       until  Pk 'is-suffix-of' Pqx;
       d(q, x) = k; // assign transition table
   end for;
end for;

return  d;
End algorithm.
  • It is m + 1 because in the next repeat loop k is decreased first.
  • It is q + 2 because in the repeat you start then with q + 1 so have at least 1 char.

The following code might have a boundary problem (q == m is missing), but wants to make the indexing a bit clearer.

m = length(P);
for  q = 0 through m - 1 do // Loop through substrings [0, q+1]
   for each character  x  in Sigma
       k = q+1;
       // work backwards from q+1
       while not Pk 'is-suffix-of' Pqx;
       do k = k-1; end do;
       d(q, x) = k; // assign transition table
   end for;
end for;

return  d;

the code has been explained so here is an example to see what's going on

Let's say the string is nano

So we want our states to be partial matches to the pattern. The possible partial matches to "nano" are "", "n", "na", "nan", or (the complete match) "nano" itself. In other words, they're just the prefixes of the string. In general, if the pattern has m characters, we need m+1 states; here m=4 and there are five states.

If we've just seen "...nan" , and see another character "x" , what state should we go to? Clearly, if x is the next character in the match (here "o"), we should go to the next longer prefix (here "nano"). And clearly, once we've seen a complete match, we just stay in that state. But suppose we see a different character, such as "a" ? That means that the string so far looks like "...nana" . The longest partial match we could be in is just "na" , ie we can utilize the last 2 characters. So from state "nan" , we should draw an arrow labeled "a" to state "na" . Note that "na" is a prefix of "nano" (so it's a state) and a suffix of "nana" (so it's a partial match consistent with what we've just seen).

In general the transition from state+character to state is the longest string that's simultanously a prefix of the original pattern and a suffix of the state+character we've just seen. This is enough to tell us what all the transitions should be. If we're looking for pattern "nano" , the transition table would be

     n       a       o      other
    ---     ---     ---     ---
empty:  "n"     empty   empty   empty       
"n":    "n"     "na"    empty   empty
"na":   "nan"   empty   empty   empty   
"nan":  "n"     "na"    "nano"  empty    //just as an illustration, nan + n = n because we can only use the last 'n', nan + a = na because now we can use the last two 'na'
"nano": "nano"  "nano"  "nano"  "nano"

so now how do we use this table to actually do pattern searching?

Simulating this on the string "banananona" , we get the sequence of states empty, empty, "n", "na", "nan", "na", "nan", "nano", "nano", "nano" by moving over one character at a time. Since we end in state "nano" , this string contains "nano" in it somewhere. so let's expand on whats going and how to use the table above, at 'b', we're at none of the possible states 'n', 'na', 'nan', 'nano' . so it counts as empty… same as when we get to 'ba' . when we hit next character 'n' , we are basically going from empty to n, so we use the table above and sees that it ends at 'n' . now we get to the 4 character of banananona, so we go from 'n' to adding a… again we use the table and see it ends up in state 'na' , so on and so forth…

The entry d(q,x) in the transition table contains the length of the longest matched prefix of the pattern after consuming the character x , if before consuming x the longest matched prefix was q characters long. Since we consume one letter, it cannot be larger than q+1 , and since the pattern has length m , it can also be at most m . The inner loop is repeat k = k-1 until condition(k) , so before it tests anything, k is decremented, thus k must start 1 larger than the largest possible result, k = min(m,q+1) + 1 . If the inner loop were a while negated_condition(k) { k = k-1; } while negated_condition(k) { k = k-1; } , one would start with k = min(m,q+1) .

Note that the transition table can be computed much more efficiently by using the borders table for the Knuth-Morris-Pratt algorithm.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM