简体   繁体   中英

What is the fastest way to find all occurrences of a substring?

This is purely out of curiosity. I was browsing through an article comparing various string search algorithms and noticed they were all designed to find the first matching substring. This got me thinking... What if I wanted to find all occurrences of a substring?

I'm sure I could create a loop that used a variant of KMP or BM and dumped each found occurrence into an array but this hardly seems like it would be the fastest.

Wouldn't a divide and conquer algorithm be superior?

For instance lets say your looking for the sequence "abc" in a string "abbcacabbcabcacbccbabc".

  1. On the first pass find all occurrences of the first character and store their positions.
  2. On each additional pass use the positions from the preceding pass to find all occurrences of next character, reducing the candidates for the next pass with each iteration.

Considering the ease with which I came up with this idea I assume someone already came up with it and improved upon it 30 years ago.

See Suffix array

Applications

The suffix array of a string can be used as an index to quickly locate every occurrence of a substring within the string. Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array, and can be found efficiently with a binary search. If implemented straightforwardly, this binary search takes O(mlogn) time, where m is the length of the substring. To avoid redoing comparisons, extra data structures giving information about the longest common prefixes (LCPs) of suffixes are constructed, giving O(m + logn) search time.

If you are only processing a given string once, the suffix array is overkill. It takes O(n log n) time to create, so a KMP style algorithm will beat it. Furthermore, if your string is enormous, or you want to get results in real-time as you receive the string, the suffix array won't work.

It is certainly possible to modify the KMP algorithm to keep going after it finds a match without taking additional memory, aside from the memory used to store the matches (which may be unnecessary as well, if you are simply printing out the matches or processing them as you go along). As a start, take the Wikipedia implementation and modify the "return m" statement to "add m to a list of indexes". But you're not done yet. You also need to ask yourself, do you allow overlapping occurrences? For example, if your substring is "abab" and you are looking in the main string "abababab", are there two occurrences or three? In the example I gave ("as a start"), you could either reset i to 0 to give the "two" answer, or you could fall through to the "otherwise" case after the "add m" to give the "three" answer.

There is no single "fastest way" it depends on

A) What the string actually is build of (length, character distribution, ...)

B) On which hardware this runs

C) If you want all results in parallel or sequential

D) Other parameters (eg can found elements overlap, are you searching once or multiple times)

E) If you see this implementation specific or just academic. In implementation there are lots of additional ways to optimize stuff. Eg temporary storage (like in your suggestion) is often very expensive.

The Idea you have eg would totally trash any CPU cache for long strings. So it would be VERY slow in those cases.

Both KMP and BM can easily be used for finding multiple matches as well. I would also recommend using Rabin-Karp , which IMHO is easier to understand but not really as fast for multiple matches (O(n+k*m) I think, where n is the length of the text, m is the length of the pattern and k is the number of occurrences). But it is easy to modify for both overlapping and non-overlapping matches.

It can also be done using suffix trees/suffix arrays, but they are harder to code and don't really buy you any increase in speed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM