简体   繁体   中英

Find common patterns in sequence of words

I have a large list of strings, in which a sequence of sounds is stored. For example:

strings = ['A','B','C','G','F','F','F','A',...,'F']

What I would like to do is perform a statistical analysis in which I would define the subsequence length and return a list (or a dictionary, which is probably more practical) that goes like this:

subsequence_length = 5
output = {['A','B','A','A','F': 129, ['B','G','G','F','F']: 112, ...}

subsequence_length = 3
output = {['A','A','F']: 209, ['G','F','F']: 198, ...}

What I have tried so far was creating a sort of kernel that follows a loop, such as:

for i in range(0, len(strings) - subsequence_length, subsequence_length):
    # count operation

I have struggled, however, with finding a fast solution (when the initial list is very large, like thousands of elements, this method is really not efficient). Is there any regex command (or similar) that can achieve this? Thanks!

You could use the natural language processing toolkit nltk (install with pip install nltk ) to achieve your task:

output = nltk.FreqDist(nltk.ngrams(strings, subsequence_length))

Using nltk.ngrams produces sub-sequences of size subsequence_length , and then using nltk.FreqDist creates a dictionary-like counter object of the sub-sequences.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM