简体   繁体   中英

Count the number of times an item occurs in a sequence using recursion Python

I'm trying to count the number of times an item occurs in a sequence whether it's a list of numbers or a string, it works fine for numbers but i get an error when trying to find a letter like "i" in a string:

def Count(f,s):
    if s == []: 
        return 0
    while len(s) != 0:
        if f == s[0]:
            return 1 + Count(f,s[1:])
        else:
            return 0 + Count(f,s[1:])

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

There's a far more idiomatic way to do it than using recursion: use the built-in count method to count occurrences.

def count(str, item):
    return str.count(item)
>>> count("122333444455555", "4")
4

However, if you want to do it with iteration , you can apply a similar principle. Convert it to a list, then iterate over the list.

def count(str, item):
    count = 0
    for character in list(str):
        if character == item:
            count += 1
    return count

The problem is your first if , which explicitly checks if the input is an empty list:

if s == []: 
    return 0

If you want it to work with str s and list s you should simply use:

if not s:
    return s

In short any empty sequence is considered false according to the truth value testing in Python and any not-empty sequence is considered true. If you want to know more about it I added a link to the relevant documentation.

You can also omit the while loop here because it's unnecessary because it will always return in the first iteration and therefore leave the loop.

So the result would be something along these lines:

def count(f, s):
    if not s: 
        return 0
    elif f == s[0]:
        return 1 + count(f, s[1:])
    else:
        return 0 + count(f, s[1:])

Example:

>>> count('i', 'what is it')
2

In case you're not only interested in making it work but also interested in making it better there are several possibilities.

Booleans subclass from integers

In Python booleans are just integers, so they behave like integers when you do arithmetic:

>>> True + 0
1
>>> True + 1
2
>>> False + 0
0
>>> False + 1
1

So you can easily inline the if else :

def count(f, s):
    if not s: 
        return 0
    return (f == s[0]) + count(f, s[1:])

Because f == s[0] returns True (which behaves like a 1) if they are equal or False (behaves like a 0) if they aren't. The parenthesis are not necessary but I added them for clarity. And because the base case always returns an integer this function itself will always return an integer.

Avoiding copies in the recursive approach

Your approach will create a lot of copies of the input because of the:

s[1:]

This creates a shallow copy of the whole list (or string, ...) except for the first element. That means you actually have an operation that uses O(n) (where n is the number of elements) time and memory in every function call and because you do this recursively the time and memory complexity will be O(n**2) .

You can avoid these copies, for example, by passing the index in:

def _count_internal(needle, haystack, current_index):
    length = len(haystack)
    if current_index >= length:
        return 0
    found = haystack[current_index] == needle
    return found + _count_internal(needle, haystack, current_index + 1)

def count(needle, haystack):
    return _count_internal(needle, haystack, 0)

Because I needed to pass in the current index I added another function that takes the index (I assume you probably don't want the index to be passed in in your public function) but if you wanted you could make it an optional argument:

def count(needle, haystack, current_index=0):
    length = len(haystack)
    if current_index >= length:
        return 0

    return (haystack[current_index] == needle) + count(needle, haystack, current_index + 1)

However there is probably an even better way. You could convert the sequence to an iterator and use that internally, at the start of the function you pop the next element from the iterator and if there is no element you end the recursion, otherwise you compare the element and then recurse into the remaining iterator:

def count(needle, haystack):
    # Convert it to an iterator, if it already
    # is an (well-behaved) iterator this is a no-op.
    haystack = iter(haystack)

    # Try to get the next item from the iterator
    try:
        item = next(haystack)
    except StopIteration:
        # No element remained
        return 0

    return (item == needle) + count(needle, haystack)

Of course you could also use an internal method if you want to avoid the iter call overhead that is only necessary the first time the function is called. However that's a micro-optimization that may not result in noticeably faster execution:

def _count_internal(needle, haystack):
    try:
        item = next(haystack)
    except StopIteration:
        return 0

    return (item == needle) + _count_internal(needle, haystack)

def count(needle, haystack):
    return _count_internal(needle, iter(haystack))

Both of these approaches have the advantage that they don't use (much) additional memory and can avoid the copies. So it should be faster and take less memory.

However for long sequences you will run into problems because of the recursion. Python has a recursion-limit (which is adjustable but only to some extend):

>>> count('a', 'a'*10000)
---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
<ipython-input-9-098dac093433> in <module>()
----> 1 count('a', 'a'*10000)

<ipython-input-5-5eb7a3fe48e8> in count(needle, haystack)
     11     else:
     12         add = 0
---> 13     return add + count(needle, haystack)

... last 1 frames repeated, from the frame below ...

<ipython-input-5-5eb7a3fe48e8> in count(needle, haystack)
     11     else:
     12         add = 0
---> 13     return add + count(needle, haystack)

RecursionError: maximum recursion depth exceeded in comparison

Recursion using divide-and-conquer

There are ways to mitigate (you cannot solve the recursion depth problem as long as you use recursion) that problem. An approach used regularly is divide-and-conquer. It basically means you divide whatever sequence you have into 2 (sometimes more) parts and do call the function with each of these parts. The recursion sill ends when only one item remained:

def count(needle, haystack):
    length = len(haystack)
    # No item
    if length == 0:
        return 0
    # Only one item remained
    if length == 1:
        # I used the long version here to avoid returning True/False for
        # length-1 sequences
        if needle == haystack[0]:
            return 1
        else:
            return 0

    # More than one item, split the sequence in
    # two parts and recurse on each of them
    mid = length // 2
    return count(needle, haystack[:mid]) + count(needle, haystack[mid:])

The recursion depth now changed from n to log(n) , which allows to make the call that previously failed:

>>> count('a', 'a'*10000)
10000

However because I used slicing it will again create lots of copies. Using iterators will be complicated (or impossible) because iterators don't have a size (generally) but it's easy to use indices:

def _count_internal(needle, haystack, start_index, end_index):
    length = end_index - start_index
    if length == 0:
        return 0
    if length == 1:
        if needle == haystack[start_index]:
            return 1
        else:
            return 0

    mid = start_index + length // 2
    res1 = _count_internal(needle, haystack, start_index, mid)
    res2 = _count_internal(needle, haystack, mid, end_index)
    return res1 + res2

def count(needle, haystack):
    return _count_internal(needle, haystack, 0, len(haystack))

Using built-in methods with recursion

It may seem stupid to use built-in methods (or functions) in this case because there is already a built-in method to solve the problem without recursion but here it is and it uses the index method that both strings and lists have:

def count(needle, haystack):
    try:
        next_index = haystack.index(needle)
    except ValueError:  # the needle isn't present
        return 0

    return 1 + count(needle, haystack[next_index+1:])

Using iteration instead of recursion

Recursion is really powerful but in Python you have to fight against the recursion limit and because there is not tail call optimization in Python it is often rather slow. This can be solved by using iterations instead of recursion:

def count(needle, haystack):
    found = 0
    for item in haystack:
        if needle == item:
            found += 1
    return found

Iterative approaches using built-ins

If you're more advantageous, one can also use a generator expression together with sum :

def count(needle, haystack):
    return sum(needle == item for item in haystack)

Again this relies on the fact that booleans behave like integers and so sum adds all the occurrences (ones) with all non-occurrences (zeros) and thus gives the number of total counts.

But if one is already using built-ins it would be a shame not to mention the built-in method (that both strings and lists have): count :

def count(needle, haystack):
    return haystack.count(needle)

At that point you probably don't need to wrap it inside a function anymore and could simply use just the method directly.

In case you even want to go further and count all elements you can use the Counter in the built-in collections module:

>>> from collections import Counter
>>> Counter('abcdab')
Counter({'a': 2, 'b': 2, 'c': 1, 'd': 1})

Performance

I often mentioned copies and their effect on memory and performance and I actually wanted to present some quantitative results to show that it actually makes a difference.

I used a fun-project of mine simple_benchmarks here (it's a third-party package so if you want to run it you have to install it):

def count_original(f, s):
    if not s: 
        return 0
    elif f == s[0]:
        return 1 + count_original(f, s[1:])
    else:
        return 0 + count_original(f, s[1:])


def _count_index_internal(needle, haystack, current_index):
    length = len(haystack)
    if current_index >= length:
        return 0
    found = haystack[current_index] == needle
    return found + _count_index_internal(needle, haystack, current_index + 1)

def count_index(needle, haystack):
    return _count_index_internal(needle, haystack, 0)


def _count_iterator_internal(needle, haystack):
    try:
        item = next(haystack)
    except StopIteration:
        return 0

    return (item == needle) + _count_iterator_internal(needle, haystack)

def count_iterator(needle, haystack):
    return _count_iterator_internal(needle, iter(haystack))


def count_divide_conquer(needle, haystack):
    length = len(haystack)
    if length == 0:
        return 0
    if length == 1:
        if needle == haystack[0]:
            return 1
        else:
            return 0
    mid = length // 2
    return count_divide_conquer(needle, haystack[:mid]) + count_divide_conquer(needle, haystack[mid:])


def _count_divide_conquer_index_internal(needle, haystack, start_index, end_index):
    length = end_index - start_index
    if length == 0:
        return 0
    if length == 1:
        if needle == haystack[start_index]:
            return 1
        else:
            return 0

    mid = start_index + length // 2
    res1 = _count_divide_conquer_index_internal(needle, haystack, start_index, mid)
    res2 = _count_divide_conquer_index_internal(needle, haystack, mid, end_index)
    return res1 + res2

def count_divide_conquer_index(needle, haystack):
    return _count_divide_conquer_index_internal(needle, haystack, 0, len(haystack))


def count_index_method(needle, haystack):
    try:
        next_index = haystack.index(needle)
    except ValueError:  # the needle isn't present
        return 0

    return 1 + count_index_method(needle, haystack[next_index+1:])


def count_loop(needle, haystack):
    found = 0
    for item in haystack:
        if needle == item:
            found += 1
    return found


def count_sum(needle, haystack):
    return sum(needle == item for item in haystack)


def count_method(needle, haystack):
    return haystack.count(needle)

import random
import string
from functools import partial
from simple_benchmark import benchmark, MultiArgument

funcs = [count_divide_conquer, count_divide_conquer_index, count_index, count_index_method, count_iterator, count_loop,
         count_method, count_original, count_sum]
# Only recursive approaches without builtins
# funcs = [count_divide_conquer, count_divide_conquer_index, count_index, count_iterator, count_original]
arguments = {
    2**i: MultiArgument(('a', [random.choice(string.ascii_lowercase) for _ in range(2**i)]))
    for i in range(1, 12)
}
b = benchmark(funcs, arguments, 'size')

b.plot()

在此处输入图片说明

It's log-log scaled to display the range of values in a meaningful way and lower means faster.

One can clearly see that the original approach gets very slow for long inputs (because it copies the list it performs in O(n**2) ) while the other approaches behave linearly. What may seem weird is that the divide-and-conquer approaches perform slower, but that is because these need more function calls (and function calls are expensive in Python). However they can process much longer inputs than the iterator and index variants before they hit the recursion limit.

It would be easy to change the divide-and-conquer approach so that it runs faster, a few possibilities that come to mind:

  • Switch to non-divide-and-conquer when the sequence is short.
  • Always process one element per function call and only divide the rest of the sequence.

But given that this is probably just an exercise in recursion that goes a bit beyond the scope.

However they all perform much worse than using iterative approaches:

在此处输入图片说明

Especially using the count method of lists (but also the one of strings) and the manual iteration are much faster.

The error is because sometimes you just have no return Value. So return 0 at the end of your function fixes this error. There are a lot better ways to do this in python, but I think it is just for training recursive programming.

You are doing things the hard way in my opinion.

You can use Counter from collections to do the same thing.

from collections import Counter

def count(f, s):
    if s == None:
        return 0
    return Counter(s).get(f)

Counter will return a dict object that holds the counts of everything in your s object. Doing .get(f) on the dict object will return the count for the specific item you are searching for. This works on lists of numbers or a string.

If you're bound and determined to do it with recursion, whenever possible I strongly recommend halving the problem rather than whittling it down one-by-one. Halving allows you to deal with much larger cases without running into stack overflow.

def count(f, s):
    l = len(s)
    if l > 1:
        mid = l / 2 
        return count(f, s[:mid]) + count(f, s[mid:])
    elif l == 1 and s[0] == f:
        return 1
    return 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM