简体繁体中英

What is the fastest way to find all occurrences of a substring?

原文 2009-10-12 20:20:10 5 4 algorithm/ search

This is purely out of curiosity. I was browsing through an article comparing various string search algorithms and noticed they were all designed to find the first matching substring. This got me thinking... What if I wanted to find all occurrences of a substring?

I'm sure I could create a loop that used a variant of KMP or BM and dumped each found occurrence into an array but this hardly seems like it would be the fastest.

Wouldn't a divide and conquer algorithm be superior?

For instance lets say your looking for the sequence "abc" in a string "abbcacabbcabcacbccbabc".

On the first pass find all occurrences of the first character and store their positions.
On each additional pass use the positions from the preceding pass to find all occurrences of next character, reducing the candidates for the next pass with each iteration.

Considering the ease with which I came up with this idea I assume someone already came up with it and improved upon it 30 years ago.

4 answers

See Suffix array

Applications

The suffix array of a string can be used as an index to quickly locate every occurrence of a substring within the string. Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array, and can be found efficiently with a binary search. If implemented straightforwardly, this binary search takes O(mlogn) time, where m is the length of the substring. To avoid redoing comparisons, extra data structures giving information about the longest common prefixes (LCPs) of suffixes are constructed, giving O(m + logn) search time.

If you are only processing a given string once, the suffix array is overkill. It takes O(n log n) time to create, so a KMP style algorithm will beat it. Furthermore, if your string is enormous, or you want to get results in real-time as you receive the string, the suffix array won't work.

It is certainly possible to modify the KMP algorithm to keep going after it finds a match without taking additional memory, aside from the memory used to store the matches (which may be unnecessary as well, if you are simply printing out the matches or processing them as you go along). As a start, take the Wikipedia implementation and modify the "return m" statement to "add m to a list of indexes". But you're not done yet. You also need to ask yourself, do you allow overlapping occurrences? For example, if your substring is "abab" and you are looking in the main string "abababab", are there two occurrences or three? In the example I gave ("as a start"), you could either reset i to 0 to give the "two" answer, or you could fall through to the "otherwise" case after the "add m" to give the "three" answer.

There is no single "fastest way" it depends on

A) What the string actually is build of (length, character distribution, ...)

B) On which hardware this runs

C) If you want all results in parallel or sequential

D) Other parameters (eg can found elements overlap, are you searching once or multiple times)

E) If you see this implementation specific or just academic. In implementation there are lots of additional ways to optimize stuff. Eg temporary storage (like in your suggestion) is often very expensive.

The Idea you have eg would totally trash any CPU cache for long strings. So it would be VERY slow in those cases.

Both KMP and BM can easily be used for finding multiple matches as well. I would also recommend using Rabin-Karp , which IMHO is easier to understand but not really as fast for multiple matches (O(n+k*m) I think, where n is the length of the text, m is the length of the pattern and k is the number of occurrences). But it is easy to modify for both overlapping and non-overlapping matches.

It can also be done using suffix trees/suffix arrays, but they are harder to code and don't really buy you any increase in speed.

How to Find all occurrences of a Substring in C

Find all occurrences of a divided substring in a string

How to find and replace all occurrences of a substring in a string?

What is the fastest way to search for substring in Java?

What is the fastest way to find all squares in an array of points?

What the fastest way to find all points near a ray?

Fastest way to find minimal Hamming distance to any substring?

Fastest way to find all primes under 4 billion

fastest way to find if all the elements of an array are distinct?

Fastest way to find all solutions of the binary equation

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to Find all occurrences of a Substring in C Find all occurrences of a divided substring in a string How to find and replace all occurrences of a substring in a string? What is the fastest way to search for substring in Java? What is the fastest way to find all squares in an array of points? What the fastest way to find all points near a ray? Fastest way to find minimal Hamming distance to any substring? Fastest way to find all primes under 4 billion fastest way to find if all the elements of an array are distinct? Fastest way to find all solutions of the binary equation

Related Tags

What is the fastest way to find all occurrences of a substring?

Question

4 answers

solution1
11 2009-10-12 20:32:33

solution2
3 2009-10-13 03:17:29

solution3
1 2009-10-12 20:30:07

solution4
0 2009-10-13 09:15:34

What is the fastest way to find all occurrences of a substring?

Question

4 answers

solution1 11 2009-10-12 20:32:33

solution2 3 2009-10-13 03:17:29

solution3 1 2009-10-12 20:30:07

solution4 0 2009-10-13 09:15:34

solution1
11 2009-10-12 20:32:33

solution2
3 2009-10-13 03:17:29

solution3
1 2009-10-12 20:30:07

solution4
0 2009-10-13 09:15:34