I have a list of lists of lists, x
, and its first two sublists are shown as below
x[0] = [['a', 'b', 'c', 'd'],
['e', 'f', 'g', 'a'],
['d', 'c', 'f'],
['e', 'g'],
]
x[1] = [['a', 'b'],
['a', 'f', 'g', 'k'],
['e', 'd', 'f'],
]
I want to find all the sequences of elements that appeared in consecutive sub-sub-lists and overall in x
at least N times. In this case it would be: ['a', 'f', 'f'], ['b', 'f', 'f'], ['a', 'f', 'd'], ['b', 'f', 'd'], ['a', 'a', 'd']
if I want 3-consecutive elements with N=2
occurrences.
If I also want 2-consecutive elements with N=2
occurrences then it would add ['a', 'f'], ['b', 'f'], ['f', 'f'], ['f', 'd'], ['a', 'a'], ['a', 'g'], ['b', 'g'], ['b', 'a'], ['g', 'f'], ['a', 'd'], ['f', 'e']
to the final output.
Is there any efficient way to achieve this and generalise to the whole x
with more than 100k sublists? Thanks.
I'm still a bit unclear about the exact rules but here's an idea which might be a starting point:
from collections import Counter
from itertools import product, chain
def count_seqs(lol, n):
return Counter(chain(*(
product(*lol[i:i + n]) for i in range(len(lol) - n + 1)
)))
count_all = sum((count_seqs(lol, 3) for lol in x), Counter([]))
N = 2
result = {seq for seq, count in count_all.items() if count >= N}
But: (1) I'm unsure if that produces the results you're looking for, and (2) I don't know how that does performance-wise.
EDIT : For inspection of the sequences in x
:
def seqs(x, n):
return [
sum((list(product(*lol[i:i + n])) for i in range(len(lol) - n + 1)), [])
for lol in x
]
print(seqs(x, 3))
EDIT 2 : Technical speed increase via multiprocessing:
from collections import Counter
from itertools import product, chain
from multiprocessing import Pool
from time import perf_counter # Only for timing
# As before - change if necessary
def count_seqs(lol, n):
return Counter(chain(*(
product(*lol[i:i + n]) for i in range(len(lol) - n + 1)
)))
# Function for summing over counts of a piece of x (a "chunk")
def count_chunk(x, n):
return sum((count_seqs(lol, n) for lol in x), Counter([]))
if __name__ == '__main__':
n = 4 # Length of the sequences
start = perf_counter() # Only for timing
count_all = count_chunk(x, n) # Essentially the "classic" way
end = perf_counter() # Only for timing
print('Classic:', end - start) # Only for timing
start = perf_counter() # Only for timing
k = 500 # Size of a "chunk"
with Pool() as p: # Multiprocessing using Pool and starmap
# Counting over the chunks in several processes
counts = p.starmap(count_chunk,
((x[i:i+k], n) for i in range(0, len(x), k)))
count_all = sum(counts, Counter([])) # Aggregating over the chunk counts
end = perf_counter() # Only for timing
print('Multiprocessing:', end - start) # Only for timing
I ran that with the following sample x
import random
elements = 'abcdefghijklmnopqrstuvwxyz'
x = [[list(random.sample(elements, random.randint(2, 10)))
for _ in range(random.randint(5, 50))]
for _ in range(10_000)]
and got as results
Classic: 2755.06
Multiprocessing: 733.94
Better, but not very encouraging for a x
of length 100,000 ...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.