简体   繁体   English

有没有更快的方法来查找列表中的重复模式?

[英]Is there a faster way to find repeated patterns in a list?

Python novice here. Python 新手这里。 I have a problem in which I want to find all of the repeated patterns within a list (it is, specifically in my case, a list of integers).我有一个问题,我想在一个列表中找到所有重复的模式(特别是在我的例子中,它是一个整数列表)。 So, for example, given the list [2,1,4,3,12,8,3,3,4,16,2,9,9,8,3,3,4,1,4,3,4,8,3,3,4] and a min pattern length of 3 the algorithm would find that [8,3,3,4] occurs thrice and [1,4,3] occurs twice (nice also to have the index of all occurrences).因此,例如,给定列表 [2,1,4,3,12,8,3,3,4,16,2,9,9,8,3,3,4,1,4,3,4 ,8,3,3,4] 和最小模式长度为 3 算法会发现 [8,3,3,4] 出现三次, [1,4,3] 出现两次(索引也很好所有事件)。

I have some code that works, if a little clumsily, but the lists that I want eventually to use the code on may be very large.我有一些代码可以工作,如果有点笨拙,但我最终想要使用代码的列表可能非常大。 I'm not really sure how to work out the operational complexity of my code, but I know that it definitely gets very slow when I am using large lists.我不太确定如何计算代码的操作复杂性,但我知道当我使用大型列表时它肯定会变得非常慢。

My question is, are there any better algorithms anyone knows for doing this, and/or am I doing this in a very inefficient way?我的问题是,是否有任何人知道这样做的更好算法,和/或我是否以非常低效的方式这样做? Thanks for any help you can give me.感谢你给与我的帮助。

Here is the code:这是代码:

# Searches list to determine how many times small list is included in big list
def contains(small, big):
    counter = 0
    # initiating list of indexes. N.B. indexlist gives LAST index of sequence, not first
    indexlist = []
    for i in range(len(big)-len(small)+1):
        for j in range(len(small)):
            if big[i+j] != small[j]:
                break
        else:
            counter += 1
            indexlist.append(i+j)
    if counter > 0:
        return counter, indexlist
    return False

def findrepeats(sequence, n_letters):
    fulldict = {}
    # Iterating through all the short-sequences of n letters in the list
    for i in range(0, len(sequence) - n_letters):
        shortliststr = ""
        shortlist = sequence[i:i + n_letters]
        for number in shortlist:
            shortliststr = shortliststr + "." + str(number)
        # If short-sequence is found in full sequence more than once (i.e. itself), add to dict
        if contains(shortlist, sequence)[0] > 1 and len(shortlist) == n_letters:
            fulldict[shortliststr] = contains(shortlist, sequence)
    return fulldict

def findallrepeats(sequence, min_letters, max_letters):
    fulldict = {}
    # Iterating through all possible n_letters in findrepeats() between given range
    for i in range(min_letters, max_letters):
        newdict = findrepeats(sequence, i)
        fulldict.update(newdict)
    return fulldict

With overlapping有重叠

You can use a sliding window of size n = 3 which iterates your sequence and count the number of occurence of this window.您可以使用大小为n = 3 的滑动 window 来迭代您的序列并计算此 window 的出现次数。

Using more_itertools .使用more_itertools

For instance:例如:

import collections
import more_itertools

sequence = [
    2, 1, 4, 3, 12, 8, 3, 3, 4, 16, 2, 9, 9,
    8, 3, 3, 4, 1, 4, 3, 4, 8, 3, 3, 4,
]
size = 3
windows = [
    tuple(window)
    for window in more_itertools.windowed(sequence, size)
]
counter = collections.Counter(windows)
for window, count in counter.items():
    if count > 1:
        print(window, count)

You get:你得到:

(1, 4, 3) 2
(8, 3, 3) 3
(3, 3, 4) 3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM