Find all Occurences of Every Substring in String

Question

I am trying to find all occurrences of sub-strings in a main string (of all lengths). My function takes one string and then returns a dictionary of every sub-string (which occurs more than once, of course) and how many times it occurs (format of the dictionary: {substring: # of occurrences, ...} ). I am using collections.Counter(s) to help me with it.

Here is my function:

from collections import Counter

def patternFind(s):
    patterns = {}
    for index in range(1, len(s)+1)[::-1]:
        d = nChunks(s, step=index)
        parts = dict(Counter(d))
        patterns.update({elem: parts[elem] for elem in parts.keys() if parts[elem] > 1})
    return patterns

def nChunks(iterable, start=0, step=1):
    return [iterable[i:i+step] for i in range(start, len(iterable), step)]

I have a string, data with about 2500 random letters (in a random order). However, there are 2 strings inserted into it (random points). Say this string is 'TEST'. data.count('TEST') returns 2. However, patternFind(data)['TEST'] gives me a KeyError . Therefore, my program does not detect the two strings in it.

What have I done wrong? Thanks!

Edit: My method of creating testing-instances:

def createNewTest():
    n = randint(500, 2500)
    x, y = randint(500, n), randint(500, n)
    s = ''
    for i in range(n):
        s += choice(uppercase)
        if i == x or i == y: s += "TEST"
    return s

Answer 1

Using Regular Expressions

Apart from the count() method you described, regex is an obvious alternative

import re

needle = r'TEST'

haystack = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklagh'
pattern = re.compile(needle)

print len(re.findall(pattern, haystack))

Short Cut

If you need to build a dictionary of substrings, possibly you can do this with only subset of those strings. Assuming you know the needle you are looking for in the data then you only need the dictionary of substrings of data that are the same length of needle . This is very fast.

from collections import Counter

needle = "TEST"

def gen_sub(s, len_chunk):
    for start in range(0, len(s)-len_chunk+1):
        yield s[start:start+len_chunk]

data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
parts = Counter([sub for sub in gen_sub(data, len(needle))])

print parts[needle]

Brute Force: building dictionary of all substrings

If you need to have a count of all possible substrings, this works but it is very slow:

from collections import Counter

def gen_sub(s):
    for start in range(0, len(s)):
        for end in range(start+1, len(s)+1):
            yield s[start:end]

data = 'khjkzahklahjTESTkahklaghTESTjklajhkhz'
parts = Counter([sub for sub in gen_sub(data)])

print parts['TEST']

Substring generator adapted from this: https://stackoverflow.com/a/8305463/1290420

Answer 2

While jurgenreza has explained why your program didn't work, the solution is still quite slow. If you only examine substrings s for which you know that s[:-1] repeats, you get a much faster solution (typically a hundred times faster and more):

from collections import defaultdict

def pfind(prefix, sequences):
    collector = defaultdict(list)
    for sequence in sequences:
        collector[sequence[0]].append(sequence)
    for item, matching_sequences in collector.items():
        if len(matching_sequences) >= 2:
            new_prefix = prefix + item
            yield (new_prefix, len(matching_sequences))
            for r in pfind(new_prefix, [sequence[1:] for sequence in matching_sequences]):
                yield r

def find_repeated_substrings(s):
    s0 = s + " "
    return pfind("", [s0[i:] for i in range(len(s))])

If you want a dict, you call it like this:

result = dict(find_repeated_substrings(s))

On my machine, for a run with 2247 elements, it took 0.02 sec, while the original (corrected) solution took 12.72 sec.

(Note that this is a rather naive implementation; using indexes of instead of substrings should be even faster.)

Edit: The following variant works with other sequence types (not only strings). Also, it doesn't need a sentinel element.

from collections import defaultdict

def pfind(s, length, ends):
    collector = defaultdict(list)
    if ends[-1] >= len(s):
        del ends[-1]
    for end in ends:
        if end < len(s):
            collector[s[end]].append(end)
    for key, matching_ends in collector.items():
        if len(matching_ends) >= 2:
            end = matching_ends[0]
            yield (s[end - length: end + 1], len(matching_ends))
            for r in pfind(s, length + 1, [end + 1 for end in matching_ends if end < len(s)]):
                yield r


def find_repeated_substrings(s):
    return pfind(s, 0, list(range(len(s))))

This still has the problem that very long substrings will exceed recursion depth. You might want to catch the exception.

Answer 3

here you can find a solution that uses a recursive wrapper around string.find() that searches all the occurences of a substring in a main string. The collectallchuncks() function returns a defaultdict whith all the substrings as keys and for each substring a list of all the indexes where the substring is found in the main string.

import collections

# Minimum substring size, may be 1
MINSIZE = 3

# Recursive wrapper
def recfind(p, data, pos, acc):
    res = data.find(p, pos)
    if res == -1:
        return acc
    else:
        acc.append(res)
        return recfind(p, data, res+1, acc)

def collectallchuncks(data):
    res = collections.defaultdict(str)
    size = len(data)
    for base in xrange(size):
        for seg in xrange(MINSIZE, size-base+1):
            chunk = data[base:base+seg]
            if data.count(chunk) > 1:
                res[chunk] = recfind(chunk, data, 0, [])
    return res

if __name__ == "__main__":
    data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
    allchuncks = collectallchuncks(data)
    print 'TEST', allchuncks['TEST']
    print 'hklag', allchuncks['hklag']

EDIT: If you just need the number of occurrences of each substring in the main string you can easily obtain it getting rid of the recursive function:

import collections

MINSIZE = 3

def collectallchuncks2(data):
    res = collections.defaultdict(str)
    size = len(data)
    for base in xrange(size):
        for seg in xrange(MINSIZE, size-base+1):
            chunk = data[base:base+seg]
            cnt = data.count(chunk)
            if cnt > 1:
                res[chunk] = cnt
    return res

if __name__ == "__main__":
    data = 'khjkzahklahjTESTkahklaghTESTjklajhkhzkhjkzahklahjTESTkahklaghTESz'
    allchuncks = collectallchuncks2(data)
    print 'TEST', allchuncks['TEST']
    print 'hklag', allchuncks['hklag']

Answer 4

The problem is in your nChunks function. It does not give you all the chunks that are necessary.

Let's consider a test string:

s='1test2345test'

For the chunks of size 4 your nChunks function gives this output:

>>>nChunks(s, step=4)
['1tes', 't234', '5tes', 't']

But what you really want is:

>>>def nChunks(iterable, start=0, step=1):
    return [iterable[i:i+step] for i in range(len(iterable)-step+1)]
>>>nChunks(s, step=4)
['1tes', 'test', 'est2', 'st23', 't234', '2345', '345t', '45te', '5tes', 'test']

You can see that this way there are two 'test' chunks and your patternFind(s) will work like a charm:

>>> patternFind(s)
{'tes': 2, 'st': 2, 'te': 2, 'e': 2, 't': 4, 'es': 2, 'est': 2, 'test': 2, 's': 2}

Find all Occurences of Every Substring in String

Question

4 answers

solution1
4 2013-04-03 14:20:05

Using Regular Expressions

Short Cut

Brute Force: building dictionary of all substrings

solution2
3 ACCPTED 2013-04-11 11:53:01

solution3
2 2013-04-06 00:11:10

solution4
2 2013-04-06 03:54:40

Find all Occurences of Every Substring in String

Question

4 answers

solution1 4 2013-04-03 14:20:05

Using Regular Expressions

Short Cut

Brute Force: building dictionary of all substrings

solution2 3 ACCPTED 2013-04-11 11:53:01

solution3 2 2013-04-06 00:11:10

solution4 2 2013-04-06 03:54:40

solution1
4 2013-04-03 14:20:05

solution2
3 ACCPTED 2013-04-11 11:53:01

solution3
2 2013-04-06 00:11:10

solution4
2 2013-04-06 03:54:40