简体   繁体   中英

finding gappy sublists within a certain range

Recently I asked a question here where I wanted to find sublists within a larger list. I have a similar but slightly different question. Suppose I have this list:

 [['she', 'is', 'a', 'student'],
 ['she', 'is', 'a', 'lawer'],
 ['she', 'is', 'a', 'great', 'student'],
 ['i', 'am', 'a', 'teacher'],
 ['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']] 

and I want to query it using matches = ['she', 'is', 'student'] , with the intention to bring from the queried list, all the sublists that contain the elements of matches in the same order. The only difference with the question in the link is that I want to add a range parameter to the find_gappy function so it would refrain from retrieving sublists in which the gap(s) between elements exceeds the specified range. For instance, in the example above, I would like a function like this:

matches = ['she', 'is', 'student']
x = [i for i in x if find_gappy(i, matches, range=2)]

which would return:

[['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student']]

The last element doesn't show up since in the sentence she is a very very exceptionally good student , the distance between a and good exceeds the range limit.

How can I write such a function?the gap between

Here is one way that also takes the order of items in match list into the consideration:

In [102]: def find_gappy(all_sets, matches, gap_range=2):
     ...:     zip_m = list(zip(matches, matches[1:]))
     ...:     for lst in all_sets:
     ...:         indices = {j: i for i, j in enumerate(lst)}
     ...:         try:
     ...:             if all(0 <= indices[j]-indices[i] - 1 <= gap_range for i, j in zip_m):
     ...:                 yield lst
     ...:         except KeyError:
     ...:             pass
     ...:         
     ...:   

Demo:

In [110]: lst = [['she', 'is', 'a', 'student'],
     ...:  ['student', 'she', 'is', 'a', 'lawer'],  # for order check
     ...:  ['she', 'is', 'a', 'great', 'student'],
     ...:  ['i', 'am', 'a', 'teacher'],
     ...:  ['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']] 
     ...:  

In [111]: 

In [111]: list(find_gappy(lst, ['she', 'is', 'student'], gap_range=2))
Out[111]: [['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student']]

If there are duplicate words in your sublists, you can use a defaultdict() to keep track of all indexes and use itertools.prodcut to compare the gap for all ordered products of word pairs.

In [9]: from collections import defaultdict
In [10]: from itertools import product

In [10]: def find_gappy(all_sets, matches, gap_range=2):
    ...:     zip_m = list(zip(matches, matches[1:]))
    ...:     for lst in all_sets:
    ...:         indices = defaultdict(list)
    ...:         for i, j in enumerate(lst):
    ...:             indices[j].append(i)
    ...:         try:
    ...:             if all(any(0 <= v - k - 1 <= gap_range for k, v in product(indices[j], indices[i])) for i, j in zip_m):
    ...:                 yield lst
    ...:         except KeyError:
    ...:             pass

Technique in the linked question is decent enough, you just need to add gaps counting along the way and, since you don't want a global count, reset the counter whenever you encounter a match. Something like:

import collections

def find_gappy(source, matches, max_gap=-1):
    matches = collections.deque(matches)
    counter = max_gap  # initialize as -1 if you want to begin counting AFTER the first match
    for word in source:
        if word == matches[0]:
            counter = max_gap  # or remove this for global gap counting
            matches.popleft()
            if not matches:
                return True
        else:
            counter -= 1
            if counter == -1:
                return False
    return False

data = [['she', 'is', 'a', 'student'],
        ['she', 'is', 'a', 'lawer'],
        ['she', 'is', 'a', 'great', 'student'],
        ['i', 'am', 'a', 'teacher'],
        ['she', 'is', 'a', 'very', 'very', 'exceptionally', 'good', 'student']]

matches = ['she', 'is', 'student']
x = [i for i in data if find_gappy(i, matches, 2)]
# [['she', 'is', 'a', 'student'], ['she', 'is', 'a', 'great', 'student']]

As a bonus, you can use it as the original function, the gap counting is applied only if you pass a positive number as max_gap .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM