简体   繁体   中英

Counting the number of different 5 characters substrings inside a string

Given a string i want to count how many substrings with len = 5 i have on it.

For example: Input: "ABCDEFG" Output: 3

And I'm not sure what should be the easiest and fast way to do this in python. Any idea?

Update:

I want only to count different substrings.

Input: "AAAAAA" Substrings: 2 times "AAAAA" Output: 1

>>> n = 5
>>> for s in 'ABCDEF', 'AAAAAA':
...     len({s[i:i+n] for i in range(len(s)-n+1)})
... 
2
1

To get the sub strings you could use NLTK like this:

>>> from nltk.util import ngrams
>>> for gram in ngrams("ABCDEFG", 5):
...     print gram
... 
('A', 'B', 'C', 'D', 'E')
('B', 'C', 'D', 'E', 'F')
('C', 'D', 'E', 'F', 'G')

You could apply a Counter and then get the unique n-grams (and their frequency) like so:

>>> Counter(ngrams("AAAAAAA", 5))
Counter({('A', 'A', 'A', 'A', 'A'): 3})

Using list comprehension (code golf) :

findSubs=lambda s,v:[''.join([s[i+j] for j in range(v)]) for i,x in enumerate(s) if i<=len(s)-v]
findCount=lambda s,v:len(findSubs(s,v))

print findSubs('ABCDEFG', 5)  #returns ['ABCDE', 'BCDEF', 'CDEFG']
print findCount('ABCDEFG', 5) #returns 3

Update

For your update, you could cast the list above to a set, back to a list, then sort the strings.

findUnique=lambda s,v:sorted(list(set(findSubs(s,v))))
findUniqueCount=lambda s,v:len(findUnique(s,v))

print findUnique('AAAAAA', 5)      #returns ['AAAAA']
print findUniqueCount('AAAAAA', 5) #returns 1

It is just the length minus 4:

def substrings(s):
    return len(s) - 4

This is true since you can create a substring for the first, second, ..., fifth-to last character as the first letter of the substring.

A general solution may be:

def count(string, nletters):
  return max(0, len(string) - nletters + 1)

Which has the use case as per your example:

print count("ABCDEFG", 5)
>>> how_much = lambda string, length: max(len(string) - length + 1, 0)
>>> how_much("ABCDEFG", 5)
3

I'm pretty sure python is not a good language to do this in, but if the length of distinct substrings you want to find is not small like 5 but larger like 1000 where your main string is very long, then a linear time solution to your problem is to build a suffix tree, you can read about them online. A suffix tree for a string of length n can be built in O(n) time, and traversing the tree also takes O(n) time and by traversing the higher levels of the tree you can count all distinct substrings of a particular length, also in O(n) time regardless of the length of substrings you want.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM