简体   繁体   English

当目标是查找某个字符串的所有出现时,KMP的最坏情况复杂度是多少?

[英]What's the worst case complexity for KMP when the goal is to find all occurrences of a certain string?

I would also like to know which algorithm has the worst case complexity of all for finding all occurrences of a string in another. 我还想知道哪种算法具有最差的案例复杂性,以便在另一个中查找所有出现的字符串。 Seems like Boyer–Moore's algorithm has a linear time complexity. 似乎Boyer-Moore的算法具有线性时间复杂度。

The KMP algorithm has linear complexity for finding all occurrences of a pattern in a string, like the Boyer-Moore algorithm¹. KMP算法具有线性复杂性,用于查找字符串中所有出现的模式,如Boyer-Moore算法¹。 If you try to find a pattern like "aaaaaa" in a string like "aaaaaaaaa", once you have the first complete match, 如果你试图在像“aaaaaaaaa”这样的字符串中找到像“aaaaaa”这样的模式,那么一旦你有了第一个完整的匹配,

aaaaaaaaa
aaaaaa
 aaaaaa
      ^

the border table contains the information that the next longest possible match (corresponding to the widest border of the pattern) of a prefix of the pattern is just one character short (a complete match is equivalent to a mismatch one past the end of the pattern in this respect). 边界表包含模式前缀的下一个最长可能匹配(对应于模式的最宽边界)的信息只有一个字符短(完全匹配相当于模式结尾之后的不匹配)这方面)。 Thus the pattern is moved one place further, and since from the border table it is known that all characters of the pattern except possibly the last match, the next comparison is between the last pattern character and the aligned text character. 因此,模式被进一步移动一次,并且由于从边界表中已知模式的所有字符除了可能的最后匹配之外,下一个比较是在最后一个模式字符和对齐的文本字符之间。 In this particular case (find occurrences of a m in a n ), which is the worst case for the naive matching algorithm, the KMP algorithm compares each text character exactly once. 在这种特殊情况下(在n中发现m的出现),这是天真匹配算法的最坏情况,KMP算法将每个文本字符恰好比较一次。

In each step, at least one of 在每一步中,至少有一个

  • the position of the text character compared 比较文本字符的位置
  • the position of the first character of the pattern with respect to the text 模式的第一个字符相对于文本的位置

increases, and neither ever decreases. 增加,并且从未减少。 The position of the text character compared can increase at most length(text)-1 times, the position of the first pattern character can increase at most length(text) - length(pattern) times, so the algorithm takes at most 2*length(text) - length(pattern) - 1 steps. 比较文本字符的位置最多可以增加length(text)-1次,第一个模式字符的位置最多可以增加length(text) - length(pattern)次数,因此算法最多需要2*length(text) - length(pattern) - 1步。

The preprocessing (construction of the border table) takes at most 2*length(pattern) steps, thus the overall complexity is O(m+n) and no more m + 2*n steps are executed if m is the length of the pattern and n the length of the text. 预处理(边界表的构造)最多需要2*length(pattern)步骤,因此总体复杂度为O(m + n),如果m是模式的长度,则不再执行m + 2*n步骤和n文本的长度。

¹ Note that the Boyer-Moore algorithm as commonly presented has a worst-case complexity of O(m*n) for periodic patterns and texts like a m and a n if all matches are required, because after a complete match, ¹请注意,如果需要所有匹配,通常呈现的Boyer-Moore算法对于周期性模式具有O(m * n)的最坏情况复杂度,并且如果需要所有匹配则具有mn的文本,因为在完全匹配之后,

aaaaaaaaa
aaaaaa
 aaaaaa
      ^
  <- <-
 ^

the entire pattern would be re-compared. 整个模式将被重新比较。 To avoid that, you need to remember how long a prefix of the pattern still matches after the shift following a complete match and only compare the new characters. 为避免这种情况,您需要记住在完全匹配后移位后模式的前缀仍然匹配多长时间,并且仅比较新字符。

There is a long article on KMP at http://en.wikipedia.org/wiki/Knuth-morris-pratt which ends with saying 关于KMP的文章很长,请访问http://en.wikipedia.org/wiki/Knuth-morris-pratt ,最后说的是

Since the two portions of the algorithm have, respectively, complexities of O(k) and O(n), the complexity of the overall algorithm is O(n + k). 由于算法的两个部分分别具有O(k)和O(n)的复杂度,因此整个算法的复杂度为O(n + k)。

These complexities are the same, no matter how many repetitive patterns are in W or S. (end quote) 无论W或S中有多少重复模式,这些复杂性都是相同的。(最终引用)

So the total cost of a KMP search is linear in the number of characters of string and pattern. 因此,KMP搜索的总成本在字符串和模式的字符数中是线性的。 I think this holds even if you need to find multiple occurrences of the pattern in the string - and if not, just consider searching for patternQ, where Q is a character that does not occur in the text, and noting down where the KMP state shows that it has matched everything up to the Q. 即使你需要在字符串中找到多次出现的模式,我认为这仍然存在 - 如果不是,只需要考虑搜索patternQ,其中Q是文本中没有出现的字符,并记下KMP状态显示的位置它已经匹配到Q的一切。

You can count Pi function for a string in O(length) . 您可以在O(length)计算字符串的Pi函数。 KMP builds a special string that has length n+m+1 , and counts Pi function on it, so in any case complexity will be O(n+m+1)=O(n+m) KMP构建一个长度为n+m+1的特殊字符串,并在其上计算Pi函数,因此无论如何复杂度为O(n+m+1)=O(n+m)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM