简体   繁体   中英

Fastest way to compare list of lists against list of sets

Is there a faster way to do the following list comprehension?

ret = [
    [
        subList 
        for subList in lst 
        if set(subList) not in listOfSets
    ] 
    for lst in listOfLists
]

Restrictions

  • listOfSets : None
  • listOfLists : Must maintain the ordering of the subsublists during construction, but not necessarily the ordering of the sublists. Ie:
[[[1, 2, 3], [4, 5, 6]], [[7, 8, 9]]] != [[[1, 2, 3], [6, 5, 4]], [7, 8, 9]]

but

[[[1, 2, 3], [4, 5, 6]], [[7, 8, 9]]] = [[[7, 8, 9]], [[1, 2, 3], [4, 5, 6]]]
  • ret : ret must maintain the same ordering of the original listOfLists as detailed above.

My code generates the following list of lists. Each list contains equally sized sublists but the number of sublists van vary. Ie:

listOfLists = [[[1, 2, 3], [4, 6, 5], [9, 8, 7]], [[11, 12, 13]], ...]

I need to filter this list of lists to remove all sublists that do not exist in a list of sets:

listOfSets = [{1, 2, 3}, {20, 30, 15}, {6, 7, 8}, ...]

ret = [
    [
        subList 
        for subList in lst 
        if set(subList) not in listOfSets
    ] 
    for lst in listOfLists
]

ret = [[[4, 6, 5], [9, 8, 7]], [[11, 12, 13]], ...]

Note the missing [1, 2, 3] in ret.

I have tried variations of the following

ret = [
    [
        subList 
        for subList in lst 
        if not(set(subList) in listOfSets)
    ] for lst in listOfLists
]

with the idea being that not(set(subList) in listOfSets) will return faster since it need only find a single match, but to no avail:

%timeit ret = [
    [
        subList 
        for subList in lst 
        if set(subList) not in listOfSets
    ] 
    for lst in listOfLists
]
772 µs ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit ret = [
    [
        subList 
        for subList in lst 
        if not(set(subList) in listOfSets)
   ] for lst in listOfLists
]
797 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Original Answer

I will not extend so much, given that the other answer is already very complete, but I got even better performance by using the difference between sets.

Let:

>>> set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]

For comparison, here is JJ Hassan's second example run on my machine (plus, note that I included the not in that was in the original question):

>>> filtered_list_of_lists_of_tuples = [[sl for sl in l if sl in set_of_tuples] for l in list_of_lists_of_tuples]
>>> filtered_list_of_lists_of_tuples
[[(1, 2, 3), (9, 8, 7)], [(11, 12, 13)]]
>>> %timeit -n1000000 -r7 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
930 ns ± 146 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Now, using the difference between sets:

>>> filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
>>> filtered_list_of_sets_of_tuples
[{(4, 6, 5)}, set()]
>>> %timeit -n1000000 -r7 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
864 ns ± 63.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Try this option as well, as I believe that the difference in speed may become more significant with bigger lists.

The idea behind this is that you have a superlist , which is a list of lists , each containing a sublist , or in this case a tuple . But, as per your requirements, the intermediate lists do not need to preserve order (only the superlist and the sublists ), and we want to take those elements that are not found in set_of_tuples . Consequently, the intermediate lists can be seen as set s, and the operation of taking the elements that do not belong to set_of_tuples is trivially the difference between sets.

Edit

I just came up with a slightly faster solution by using functools and itertools . This new solution, however, is better only when we enough data.

Let us start with the previous solution:

filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]

Now, by a simple application of map , this becomes:

filtered_list_of_sets_of_tuples = [s - set_of_tuples for s in map(set, list_of_lists_of_tuples)]

Then we can use operator.sub to rewrite this as:

from operator import sub
filtered_list_of_sets_of_tuples = [sub(s, set_of_tuples) for s in map(set, list_of_lists_of_tuples)]

or, using a plain list :

from operator import sub
filtered_list_of_sets_of_tuples = list(sub(s, set_of_tuples) for s in map(set, list_of_lists_of_tuples))

Finally, we use map once more, this time bringing itertools.repeat to the game:

from itertools import repeat
from operator import sub

filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))

This new method is actually the slowest given small lists:

>>> set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]
>>> %timeit -n1000000 -r20 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
903 ns ± 168 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
>>> %timeit -n1000000 -r20 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
789 ns ± 70 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
>>> %timeit -n1000000 -r20 filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
1.28 µs ± 299 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)

But now let us define bigger lists. I used approximately the sizes you mentioned in a comment:

>>> from random import randint
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]] * 100
>>> set_of_tuples = {(randint(0, 100), randint(0, 100), randint(0, 100)) for _ in range(2680)}

Using this new data, here are the results I got on my machine:

>>> %timeit -n10000 -r20 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
65 µs ± 7.05 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
>>> %timeit -n10000 -r20 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
58.1 µs ± 6.67 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
>>> %timeit -n10000 -r20 filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
54.1 µs ± 5.34 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)

Note: the last example here is closest to a speed increase on your example, but the first two might be useful as they're even faster although they change the behaviour slightly. tl;dr change your list of sets to a set of frozensets for faster membership checks

You mention that your list of sets can be changed to find an ideal solution, so I'd suggest using a set of something hashable like tuple s, or (in the last example) frozenset s. The type of membership test you're doing is much faster when using sets like I do in these examples.

Example 1: Using tuples with casts

In this example we convert the sub-sub-lists to tuples, and use a set of tuples. This is better because tuples are hashable and we can have a set of them. Sets are the best container for membership checks.

set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}

list_of_lists_of_lists = [[[1, 2, 3], [4, 6, 5], [9, 8, 7]], [[11, 12, 13]]]

filtered_list_of_lists_of_lists =  [[sl for sl in l if tuple(sl) in set_of_tuples] for l in list_of_lists_of_lists]

filtered_list_of_lists_of_lists
# >> [[[1, 2, 3], [9, 8, 7]], [[11, 12, 13]]]

# %timeit filtered_list_of_lists_of_lists =  [[sl for sl in l if tuple(sl) in set_of_tuples] for l in list_of_lists_of_lists]
# 691 ns ± 91.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Example 2: Using tuples without casts

if we're allowed to consume a list_of_lists_of_tuples instead it becomes even faster because we don't have to cast the lists

set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}

list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]

filtered_list_of_lists_of_tuples =  [[sl for sl in l if sl in set_of_tuples] for l in list_of_lists_of_tuples]

filtered_list_of_lists_of_tuples
# >> [[(1, 2, 3), (9, 8, 7)], [(11, 12, 13)]]

# %timeit filtered_list_of_lists_of_tuples =  [[sl for sl in l if sl in set_of_tuples] for l in list_of_lists_of_tuples]
# >> 474 ns ± 58.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

This assumes that the order of the sub-sub-list/tuple is important though, as it changes the behaviour of your code slightly. In your code a sub-sub-list of [1,2,3] would be included if the set {3,1,2} was in your list of sets because set([1,2,3]) == {3,1,2} .

Example 3: Using frozensets

If we wanted to preserve that behaviour then we can use frozenset which is also hashable. We'll use a set of frozensets.

set_of_frozen_sets = {frozenset([1, 2, 3]), frozenset([9,8,7]), frozenset([11,12,13]), frozenset([6,7,9])}

list_of_lists_of_lists = [[[1, 2, 3], [4, 6, 5], [9, 8, 7]], [[11, 12, 13]]]

filtered_list_of_lists_of_lists =  [[sl for sl in l if frozenset(sl) in set_of_frozen_sets] for l in list_of_lists_of_lists]

filtered_list_of_lists_of_lists
# >> [[[1, 2, 3], [9, 8, 7]], [[11, 12, 13]]]

# %timeit filtered_list_of_lists_of_lists =  [[sl for sl in l if frozenset(sl) in set_of_frozen_sets] for l in list_of_lists_of_lists]
# >> 1.13 µs ± 92.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

which appears to take a fair bit longer than the first two examples (I suppose the conversion to frozenset from list is bit more expensive).

Please try those and see if you notice the same speed increases on your machine.

Note: One other possible difference in behaviour is my code will leave an empty sublist in the return value if no sub-sub-lists in that sublist match

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM