简体   繁体   中英

Better understanding and comparison of Boyer-Moore and KMP algorithm

I have been recently understanding different string searching algorithms like Knuth-Morris-Pratt and Boyer Moore Algorithm and in doing so I have been introduced to some details about both of them which I am not able to digest or have developed my own understanding about those but still uncertain of their correctness.

Questions:

  1. The top answer to this question states that KMP works well if alphabet is small. Why exactly is that the situation and why can't Boyer's algorithm perform better than KMP in such case?
  2. What is an example for each where KMP and Boyer's algorithm give worst performance? I have figured out that for an example like this Boyer would give worst performance. Is that right?

text=' AAAAA....13 A'S '

pattern='AAA'

3.I was able to understand the proper prefix aspect of KMP and was also able to digest the fact that it doesn't skip possible matches while skipping already matched portion of the text but even though I did get the intuition behind the Bad Character Heuristic and Good Suffix Heuristic of Boyer algorithm which focuses on skipping characters so that the pattern coincides with the possible future matches, I am still not able to make myself understand how both of the heuristics guarantee that the skipped characters won't give matches anyway.

The 4th Paragraph of the 2nd page in the given document talks about the same that we can skip certain characters of the text without looking at them. Why can we ignore them?

  1. In Layman's language can we claim that the difference between KMP and Boyer algorithm is that the KMP works by skipping already matched characters and Boyer by skipping characters which won't make any difference as the current position of window on the text already has a miss match.

first you should differentiate between the original Boyer-Moore and the one with the Galil rule implementation, cause they have different kind of complexity on the worst case. lets look on the original Boyer-Moore algorithm on the different cases:

Worst-case performance Θ(m) preprocessing + O(mn) matching.

Best-case performance Θ(m) preprocessing + Ω(n/m).

you could see that the worst cast for matching in the original is not even linear which is much worse than KMP (O(m+n)) complexity. but in the other hand it can get into sub-linear time for the best case. This case is can be depended on the bad character rule like so:

Lets assume you have n long pattern but at the end of the pattern it have a character that doesn't occur at all in T (or almost doesn't occur). If that so you wouldn't need even to traverse the whole size of T, you could just jump each time you have a miss match. This is why is better solution for bigger alphabet cause you have higher chance to find those kind of characters and make those jumps.

if you insist for examples:

example for KMP better than Boyer-Moore:

T: AAAAAA.....

P: AAA

example for Boyer-Moore better than KMP:

T:ABCDABCDABCD.....

P:ABCF

about your third question, you should understand that each rule in Boyer-Moore is sufficient to find all recurrences in T because what each rule do is to eliminate cases which isn't possible to find the pattern:

bad character rule eliminate all cases of character doesn't even in the instance so it jump to where it does/pass over it.

good suffix rule eliminate all cases when the suffix you already found doesn't fit in your pattern if you slide your pattern, (more like you're sliding the the first occurrence it does if it exist.. actually is really similar to KMP in the idea but with suffix and not a prefix).

you can think as if you would do the naive solution of checking all n*m but with both of those rules can be act as elimination processes, so you take max between those rules and eliminate those cases.

and about your last question, I think Yes..this is pretty good synopsis for the idea of KMP and Boyer-Moore

Also remember that with the Galil rule you maybe have a better result in the worst case and average case than kmp in the time complexity but not in the space complexity (depended on the implementation too)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM