简体   繁体   中英

Time complexity analysis of recursive algorithm

I would like to know what's the complexity of the following algorithm and, most importantly, the step by step process which leads to deducing it.

I suspect it's O(length(text)^2*length(pattern)) but I have trouble solving the recurrence equation.

How would the complexity improve when doing memoization (ie dynamic programming) on the recursive calls?

Also, I would appreciate pointers to techniques/books which would help me learn how to analyze this kind of algorithms.

In Python:

def count_matches(text, pattern):
  if len(pattern) == 0: return 1

  result = 0
  for i in xrange(len(text)):
    if (text[i] == pattern[0]):
      # repeat the operation with the remaining string a pattern
      result += count_matches(text[i+1:], pattern[1:])

  return result

In C:

int count_matches(const char text[],    int text_size, 
                  const char pattern[], int pattern_size) {

  if (pattern_size == 0) return 1;

  int result = 0;

  for (int i = 0; i < text_size; i++) {
    if (text[i] == pattern[0])
      /* repeat the operation with the remaining string a pattern */
      result += count_matches(text+i, text_size-(i+1), 
                              pattern+i, pattern_size-(i+1));
  }

  return result;  
}

Note: The algorithm intentionally repeats the matching for every substring. Please don't focus in what kind of matching the algorithm is performing, just on its complexity.

Apologies for the (now fixed) typos in the algorithms

My intuition that the complexity is O(length(text)^3) is incorrect. It is actually O(n!) purely because the implementation is of form

def do_something(relevant_length):
    # base case

    for i in range(relevant_length):
        # some constant time work

        do_something(relevant_length - 1)

as discussed in Example of O(n!)?

If memoization is used, the recursion tree is produced once and then subsequently looked up every time after.

Picture the shape of the recursion tree.

We make progress one character per layer. There are 2 base cases. The recursion bottoms out when we reach the end of pattern OR if there are no longer any characters in text through which to iterate. The first base case is explicit but the second base case just occurs given the implementation.

So the depth (height) of the recursion tree is min[length(text), length(pattern)].

How many subproblems? We also make progress one character per layer. If all characters in text were compared, using the Gauss trick for summing S = [n(n+1)] / 2, the total number of subproblems that will ever be evaluated, across all recursion layers, is {length(text) * [length(text) + 1]} / 2.

Take length(text) = 6 and length(pattern) = 10, where length(text) < length(pattern). The depth is min[length(text), length(pattern)] = 6.

PTTTTT
PTTTT
PTTT
PTT
PT
P

What about if length(text) = 10 and length(pattern) = 6, where length(text) > length(pattern). The depth is min[length(text), length(pattern)] = 6.

PTTTTTTTTT
PTTTTTTTT
PTTTTTTT
PTTTTTT
PTTTTT
PTTTT

What we see is that the length(pattern) doesn't really contribute to complexity analysis. In cases that length(pattern) < length(text), we're just hacking off a bit of the Gauss sum.

But, because text and pattern step forward together one for one, we end up doing much less work. The recursion tree looks like the diagonal of a square matrix.

For length(text) = 6 and length(pattern) = 10 as well as for length(text) = 10 and length(pattern) = 6, the tree is

P
 P
  P
   P
    P
     P

Hence, the complexity of the memoized approach is

O( min( length(text), length(pattern) ) )

Edit: Given @fons comment, what if recursion is never triggered? Specifically in the case when text[i] == pattern[0] for all i is never true. Then iterating through all of text is the dominating factor, even if length(text) > length(pattern).

So that implies the actual upper bound of the memoized approach is

O( max( length(text), length(pattern) ) )

Thinking about it a bit more, in the case when length(text) > length(pattern) and recursion IS triggered, even when pattern is exhausted, it takes constant time to recurse and check that pattern is now empty, so length(text) still dominates.

This makes the upper bound of te memoized version O(length(text)).

Ehm... I could be wrong but as far as I see, your runtime should be focused on this loop:

for c in text:
    if (c == pattern[0]):
      # repeat the operation with the remaining string a pattern
      result += count_matches(text[1:], pattern[1:])

Basically let the length of your text be n , we don't need the length of the pattern.

The first time this loop is run (in the parent function) we will have n calls to it. Each of those n calls will in the worst case call n-1 instances of your program. Then those n-1 instances will in the worst case call n-2 instances and so on.

This results in an equation that is going to be n*(n-1) (n-2) ...*1 which is n! . So your worst case runtime is O(n!) . Pretty bad (:

I run your python program several times with input that would cause the worst case runtime:

In [21]: count_matches("aaaaaaa", "aaaaaaa")

Out[21]: 5040

In [22]: count_matches("aaaaaaaa", "aaaaaaaa")

Out[22]: 40320

In [23]: count_matches("aaaaaaaaa", "aaaaaaaaa")

Out[23]: 362880

The last input is 9 symbols and 9! = 362880.

To analyze the runtime of your algorithm you need to first think of the input that causes the worst possible runtime. In your algorithm best and worst vary quite a bit so you probably need average case analysis but that is quite complicated. (You would need to define what input is average and how often worst case would be seen.)

Dynamic programming can help alleviate your runtime quite a bit, but analysis is harder. Let's first code a simple unoptimized dynamic programming version:

cache = {}
def count_matches_dyn(text, pattern):
  if len(pattern) == 0: return 1

  result = 0
  for c in text:
    if (c == pattern[0]):
      # repeat the operation with the remaining string a pattern
      if ((text[1:], pattern[1:]) not in cache.keys()):
        cache[(text[1:], pattern[1:])] = count_matches_dyn(text[1:], pattern[1:])
        result += cache[(text[1:], pattern[1:])]
      else:
        result += cache[(text[1:], pattern[1:])]

  return result

Here we cache all calls to to count_matches in a dictionary so when we call count matches with the same input we will get the result instead of calling the function again. (This is known as memoization ).

Now let's analyze it. The main loop

  for c in text:
    if (c == pattern[0]):
      # repeat the operation with the remaining string a pattern
      if ((text[1:], pattern[1:]) not in cache.keys()):
        cache[(text[1:], pattern[1:])] = count_matches_dyn(text[1:], pattern[1:])
        result += cache[(text[1:], pattern[1:])]
      else:
        result += cache[(text[1:], pattern[1:])]

Will run n times on the first call (our cache is empty). However the first recursive call will populate the cache:

cache[(text[1:], pattern[1:])] = count_matches_dyn(text[1:], pattern[1:])

And every other call in the same loop will cost (O(1) . So basically the top level recursion will cost O(n-1) + (n-1)*O(1) = O(n-1) + O(n-1) = 2*O(n-1) . You can see that from the calls further down the recursion only the first one will descend with many recursive calls (the O(n-1) call) and the rest will cost O(1) because they are just dictionary lookups. Given all that was said the runtime is (2*O(n-1) which is amortized to O(n) .

Disclaimer. I am not entirely sure about the analysis of the dynamic programming version, please feel free to correct me (:

Disclaimer 2. The dynamic programming code contains expensive operations (text[1:], pattern[1:]) which are not factored in the analysis. This is done on purpose because in any reasonable implementation you can drastically reduce the cost of those calls. The point is to show how simple caching can drastically reduce runtime.

  • First, let us rise above the code and formulate the problem this code is trying to solve.

The Python version seems to count the number of occurrences of pattern as a subsequence of text . The C version currently looks broken, so I'll assume below that the Python version is right.

  • Then, look back at the code and note some general things about how the solution is carried out.

The function calculates the answer by adding up 0s and 1s. Thus the number of operations is at least the number of 1s one needs to add up to get the answer, that is, the answer itself.

  • Now, let us devise an input (text, pattern) which will give the worst possible runtime for given lengths of text and pattern .

The largest answer is clearly some case where all letters are equal.

  • After that, we use the above simplification of input and some knowledge of mathematics to calculate the answer directly.

When all letters are equal, the answer is essentially the number of ways to choose k = len (pattern) items (letters) out of n = len (text) , which is choose (n, k) .

  • Next, we pick lengths of text and pattern which give us the worst possible complexity.

By example: for text = 'a' * 100 and pattern = 'a' * 50 , we have the answer choose (100, 50) = 100! / 50! / 50! choose (100, 50) = 100! / 50! / 50! . Generally, for a fixed length of text , the length of pattern must be half of that, rounded either side if necessary. It's an intuitive notion one gets when looking at Pascal's triangle . Formally, this is trivial to prove by comparing choose (n, k) and choose (n, k+-1) by hand.

  • Estimate the answer we got.

The sum choose (n, 0) + choose (n, 1) + ... + choose (n, n) is 2 n , and intuitively again, choose (n, n/2) is a considerable fraction of that. More formally, by Stirling's formula , it turns out choose (n, n/2) is on the order of 2 n divided by sqrt(n) .

  • Finally, note that more detailed analysis is probably unnecessary.

When the complexity is exponential, we usually are less interested in precise polynomial factors. Say, 2 100 ( O (2^n) ) and 100 times 2 100 ( O (n * 2^n) ) operations are equally impossible to complete in reasonable time. What would matter is to reduce O (2^n) to O (2^(n/2)) , or better, to find a polynomial solution.

  • Recall that what we found is a lower bound.

Actually, the complexity would indeed be choose (len (text), len (pattern) multiplied by some polynomial, if we add the following line at the top:

if len(pattern) < len(text): return 0

Indeed, there can be no match if the number of letters left in the text is less than the length of pattern.

  • Here is a view from another angle.

Otherwise, we can have a larger number of recursion branches which ultimately result in adding 0 to the answer.

By looking from another side, we can prove that the number of operations in unaltered code can be as high as 2 to the power of len(text) .

Indeed, when text = 'a' * n and pattern = 'a' * n , suppose we already processed k letters of text . Each of these letters, independently from others, could have been either matched with some letter of pattern or left out in the loop. So, we have two ways to go for each letter of text , and so 2^n ways to go when we processed n letters of text , that is, arrived at a terminating call of our recursive function.

The time complexity should improve to something of the order of O(length(text) * length(pattern)) from the recursive one ( O(n!) ).

The memorized solution (DP) would involve building lookup table of text-vs-pattern which can be built up incrementally starting from the end of the text and pattern.

I'm afraid your algorithm is incorrect for a pattern matching. Mainly because it will search for a sub-sub-string in the rest of the text once it will find that a first character matches. For example for the text "abbccc" and a pattern "accc", your algorithm will return a result equal to 1.

You should consider implementing the "Naive" Algorithm for pattern matching, which is very similar to what you were trying to do, but without recursion. Its complexity is O(n*m) where 'n' is the text length, and 'm' is the pattern length. In Python you could use the following implementation:

text = "aaaaabbbbcccccaaabbbcccc"
pattern = "aabb"
result = 0

    index = text.find(pattern)
    while index > -1:
        result += 1
        print index
        index = text.find(pattern, index+1)

return result

Regarding books on the subject, my best recommendation is Cormen's "Introduction to Algorithms" , which covers all the material on algorithms and complexity.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM