Find all the sequences of elements that occurred more than N times in a list of nested lists

Question

I have a list of lists of lists, x , and its first two sublists are shown as below

x[0] = [['a', 'b', 'c', 'd'],
     ['e', 'f', 'g', 'a'],
     ['d', 'c', 'f'],
     ['e', 'g'],
    ]

x[1] = [['a', 'b'],
     ['a', 'f', 'g', 'k'],
     ['e', 'd', 'f'],
    ]

I want to find all the sequences of elements that appeared in consecutive sub-sub-lists and overall in x at least N times. In this case it would be: ['a', 'f', 'f'], ['b', 'f', 'f'], ['a', 'f', 'd'], ['b', 'f', 'd'], ['a', 'a', 'd'] if I want 3-consecutive elements with N=2 occurrences.

If I also want 2-consecutive elements with N=2 occurrences then it would add ['a', 'f'], ['b', 'f'], ['f', 'f'], ['f', 'd'], ['a', 'a'], ['a', 'g'], ['b', 'g'], ['b', 'a'], ['g', 'f'], ['a', 'd'], ['f', 'e'] to the final output.

Is there any efficient way to achieve this and generalise to the whole x with more than 100k sublists? Thanks.

Answer 1

I'm still a bit unclear about the exact rules but here's an idea which might be a starting point:

from collections import Counter
from itertools import product, chain

def count_seqs(lol, n):
    return Counter(chain(*(
               product(*lol[i:i + n]) for i in range(len(lol) - n + 1)
           )))

count_all = sum((count_seqs(lol, 3) for lol in x), Counter([]))
N = 2
result = {seq for seq, count in count_all.items() if count >= N}

But: (1) I'm unsure if that produces the results you're looking for, and (2) I don't know how that does performance-wise.

EDIT : For inspection of the sequences in x :

def seqs(x, n):
    return [
        sum((list(product(*lol[i:i + n])) for i in range(len(lol) - n + 1)), [])
        for lol in x
    ]

print(seqs(x, 3))

EDIT 2 : Technical speed increase via multiprocessing:

from collections import Counter
from itertools import product, chain
from multiprocessing import Pool
from time import perf_counter  # Only for timing

# As before - change if necessary
def count_seqs(lol, n):
    return Counter(chain(*(
               product(*lol[i:i + n]) for i in range(len(lol) - n + 1)
           )))

# Function for summing over counts of a piece of x (a "chunk")
def count_chunk(x, n):
    return sum((count_seqs(lol, n) for lol in x), Counter([]))

if __name__ == '__main__':

    n = 4  # Length of the sequences

    start = perf_counter()  # Only for timing
    count_all = count_chunk(x, n)  # Essentially the "classic" way
    end = perf_counter()  # Only for timing
    print('Classic:', end - start)  # Only for timing

    start = perf_counter()  # Only for timing
    k = 500  # Size of a "chunk"
    with Pool() as p:  # Multiprocessing using Pool and starmap
        # Counting over the chunks in several processes
        counts = p.starmap(count_chunk,
                           ((x[i:i+k], n) for i in range(0, len(x), k)))
    count_all = sum(counts, Counter([]))  # Aggregating over the chunk counts
    end = perf_counter()  # Only for timing
    print('Multiprocessing:', end - start)  # Only for timing

I ran that with the following sample x

import random

elements = 'abcdefghijklmnopqrstuvwxyz'
x = [[list(random.sample(elements, random.randint(2, 10)))
      for _ in range(random.randint(5, 50))]
     for _ in range(10_000)]

and got as results

Classic: 2755.06
Multiprocessing: 733.94

Better, but not very encouraging for a x of length 100,000 ...

Find all the sequences of elements that occurred more than N times in a list of nested lists

Question

1 answers

solution1
1 ACCPTED 2021-07-01 17:20:43

Find all the sequences of elements that occurred more than N times in a list of nested lists

Question

1 answers

solution1 1 ACCPTED 2021-07-01 17:20:43

solution1
1 ACCPTED 2021-07-01 17:20:43