Performance of Occurences of Substring in String

Question

I came across the task to find all occurences of a substring in another string and was wondering what will be the best algorithm to solve this problem.

For demonstration purposes I used the string "The cat sat on the mat" and search for all occurences of the substring "at". This should ultimately result in an occurence count of 3. Since I'm programming in java at the moment the first thing that popped into my mind was this:

    public static void main(String[] args) {

      int count=0;
      String s = "The cat sat on the mat";

      Pattern pattern = Pattern.compile("at");
      Matcher matcher = pattern.matcher(s);
      while(matcher.find()){
          count++;
      }

      System.out.println("Pattern: "+pattern+" Count: "+count);
    }

Somehow I doubt that this is the optimal solution for this problem. So if someone knows how an optimal (or at least pretty good) solution should look ... please answer! You can post your answer in any language not necessarily java (altough that would be great :)).

Thanks a lot!

Answer 1

There are quite some impressive substring algorithms. Often the Boyer-Moore algorithm ( http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm ) is mentioned, but there are other alternatives, like http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm and http://en.wikipedia.org/wiki/Rabin-karp .

Answer 2

Without the overhead of regular expressions:

public static void main(String[] args) {

    int count = 0;
    String s = "The cat sat on the mat";
    String substring = "at";

    int pos = s.indexOf(substring);
    while (pos > -1) {
        count++;
        pos = s.indexOf(substring, pos + 1);
    }

    System.out.println("Pattern: "+pattern+" Count: "+count);
}

I did a quick test searching for "at" in the text of the Boyer–Moore string search algorithm article on Wikipedia. They both find the same number of matches, but doing this 10.000 times on my machine took the regex algorithm 1702 milliseconds and this just 192!

Answer 3

As usual, it depends.

The theoretic best approach is to probably to use suffix trees -- but they only start making sense on very large strings. Suffix arrays are slightly harder to use, but make sense for smaller strings. IIRC, the zlib deflate algorithm uses suffix arrays to find repeated substrings. In either case, the algorithms are not straightforward, and need quite a bit of study to understand and to implement efficiently.

If you're just worried about programmer productivity and easily understood code, I guess it's hard to beat what you've written. Assuming a reasonably intelligent regexp parser, it might be fast enough for normal use.

Performance of Occurences of Substring in String

Question

3 answers

solution1
2 ACCPTED 2010-08-27 10:02:42

solution2
1 2010-08-28 15:37:33

solution3
0 2010-08-27 09:57:28

Performance of Occurences of Substring in String

Question

3 answers

solution1 2 ACCPTED 2010-08-27 10:02:42

solution2 1 2010-08-28 15:37:33

solution3 0 2010-08-27 09:57:28

solution1
2 ACCPTED 2010-08-27 10:02:42

solution2
1 2010-08-28 15:37:33

solution3
0 2010-08-27 09:57:28