简体   繁体   English

具有有限自动机的字符串匹配

[英]string matching with finite automata

I am reading about string algorithms in Cormen's book "Introduction to Algorithms". 我正在阅读Cormen的书“算法导论”中的字符串算法。 For Transition which is shown below. 对于过渡,如下所示。

My question: why are we doing min(m+1, q+2) and why are we incrementing m by 1 and q by 2. 我的问题:为什么我们要做min(m+1, q+2) ,为什么我们要m递增1,q递增2。

Following link has back ground to above question. 以下链接有上述问题。

http://people.scs.carleton.ca/~maheshwa/courses/5703COMP/Fall2009/StringMatching.pdf http://people.scs.carleton.ca/~maheshwa/courses/5703COMP/Fall2009/StringMatching.pdf

Kindly help here with a simple example. 请帮助一个简单的例子。

Algorithm Compute-Transition-Function(P, Sigma)
m = length(P);
for  q = 0 through m  do
   for each character  x  in Sigma
       k = min(m+1, q+2);
       repeat  k = k-1  // work backwards from q+1
       until  Pk 'is-suffix-of' Pqx;
       d(q, x) = k; // assign transition table
   end for;
end for;

return  d;
End algorithm.
  • It is m + 1 because in the next repeat loop k is decreased first. 它是m + 1因为在下一个repeat循环中, k首先减小。
  • It is q + 2 because in the repeat you start then with q + 1 so have at least 1 char. 这是q + 2因为在repeat你开始然后用q + 1所以至少有1个字符。

The following code might have a boundary problem (q == m is missing), but wants to make the indexing a bit clearer. 以下代码可能存在边界问题(q == m丢失),但希望使索引更清晰。

m = length(P);
for  q = 0 through m - 1 do // Loop through substrings [0, q+1]
   for each character  x  in Sigma
       k = q+1;
       // work backwards from q+1
       while not Pk 'is-suffix-of' Pqx;
       do k = k-1; end do;
       d(q, x) = k; // assign transition table
   end for;
end for;

return  d;

the code has been explained so here is an example to see what's going on 代码已被解释,所以这里是一个例子来看看发生了什么

Let's say the string is nano 假设字符串是nan​​o

So we want our states to be partial matches to the pattern. 所以我们希望我们的状态与模式部分匹配。 The possible partial matches to "nano" are "", "n", "na", "nan", or (the complete match) "nano" itself. "nano"部分匹配可能是"", "n", "na", "nan", or (the complete match) "nano"本身。 In other words, they're just the prefixes of the string. 换句话说,它们只是字符串的前缀。 In general, if the pattern has m characters, we need m+1 states; 一般来说,如果模式有m个字符,我们需要m + 1个状态; here m=4 and there are five states. 这里m = 4,有五个州。

If we've just seen "...nan" , and see another character "x" , what state should we go to? 如果我们刚看到"...nan" ,看到另一个字符"x" ,我们应该去哪个州? Clearly, if x is the next character in the match (here "o"), we should go to the next longer prefix (here "nano"). 显然,如果x是匹配中的下一个字符(这里是“o”),我们应该转到下一个更长的前缀(这里是“nano”)。 And clearly, once we've seen a complete match, we just stay in that state. 很明显,一旦我们看到完全匹配,我们就会保持这种状态。 But suppose we see a different character, such as "a" ? 但是假设我们看到一个不同的角色,比如"a" That means that the string so far looks like "...nana" . 这意味着到目前为止,字符串看起来像"...nana" The longest partial match we could be in is just "na" , ie we can utilize the last 2 characters. 我们可能会遇到的最长的部分匹配只是"na" ,即我们可以利用最后2个字符。 So from state "nan" , we should draw an arrow labeled "a" to state "na" . 因此,从状态"nan" ,我们应绘制一个标记为"a"的箭头,将状态为"na" Note that "na" is a prefix of "nano" (so it's a state) and a suffix of "nana" (so it's a partial match consistent with what we've just seen). 请注意, "na""nano" (因此是状态)的前缀,并且是"nana"的后缀(因此是与我们刚刚看到的部分匹配的部分)。

In general the transition from state+character to state is the longest string that's simultanously a prefix of the original pattern and a suffix of the state+character we've just seen. 通常,从状态+字符到状态的过渡是最长的字符串,它同时是原始模式的前缀和我们刚刚看到的状态+字符的后缀。 This is enough to tell us what all the transitions should be. 这足以告诉我们所有转换应该是什么。 If we're looking for pattern "nano" , the transition table would be 如果我们正在寻找模式"nano" ,那么过渡表就是

     n       a       o      other
    ---     ---     ---     ---
empty:  "n"     empty   empty   empty       
"n":    "n"     "na"    empty   empty
"na":   "nan"   empty   empty   empty   
"nan":  "n"     "na"    "nano"  empty    //just as an illustration, nan + n = n because we can only use the last 'n', nan + a = na because now we can use the last two 'na'
"nano": "nano"  "nano"  "nano"  "nano"

so now how do we use this table to actually do pattern searching? 那么现在我们如何使用这个表来实际进行模式搜索?

Simulating this on the string "banananona" , we get the sequence of states empty, empty, "n", "na", "nan", "na", "nan", "nano", "nano", "nano" by moving over one character at a time. 在字符串"banananona"上模拟这个,我们得到状态序列为空,空, "n", "na", "nan", "na", "nan", "nano", "nano", "nano"通过一次移动一个角色。 Since we end in state "nano" , this string contains "nano" in it somewhere. 由于我们以状态"nano"结束,因此该字符串在某处包含"nano" so let's expand on whats going and how to use the table above, at 'b', we're at none of the possible states 'n', 'na', 'nan', 'nano' . 所以让我们来看看最新情况以及如何使用上面的表格,在'b',我们没有任何可能的状态'n', 'na', 'nan', 'nano' so it counts as empty… same as when we get to 'ba' . 所以它算得空了......就像我们到了'ba' when we hit next character 'n' , we are basically going from empty to n, so we use the table above and sees that it ends at 'n' . 当我们打到下一个字符'n' ,基本上是从空到n,因此我们使用上面的表,并看到它以'n'结尾。 now we get to the 4 character of banananona, so we go from 'n' to adding a… again we use the table and see it ends up in state 'na' , so on and so forth… 现在我们得到了banananona的4个角色,所以我们从'n'开始添加...再次我们使用该表并看到它最终处于'na'状态,依此类推......

The entry d(q,x) in the transition table contains the length of the longest matched prefix of the pattern after consuming the character x , if before consuming x the longest matched prefix was q characters long. 转换表中的条目d(q,x)包含消耗字符x后模式的最长匹配前缀的长度,如果在消耗x之前,最长匹配前缀是q字符长。 Since we consume one letter, it cannot be larger than q+1 , and since the pattern has length m , it can also be at most m . 因为我们消耗一个字母,所以它不能大于q+1 ,并且由于该模式具有长度m ,所以它也可以最多为m The inner loop is repeat k = k-1 until condition(k) , so before it tests anything, k is decremented, thus k must start 1 larger than the largest possible result, k = min(m,q+1) + 1 . 内循环repeat k = k-1 until condition(k) ,因此在测试任何东西之前, k都会递减,因此k必须以大于最大可能结果的1开始, k = min(m,q+1) + 1 If the inner loop were a while negated_condition(k) { k = k-1; } 如果内部循环是while negated_condition(k) { k = k-1; } while negated_condition(k) { k = k-1; } , one would start with k = min(m,q+1) . while negated_condition(k) { k = k-1; } ,将从k = min(m,q+1)

Note that the transition table can be computed much more efficiently by using the borders table for the Knuth-Morris-Pratt algorithm. 注意,通过将边界表用于Knuth-Morris-Pratt算法,可以更高效地计算转换表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM