简体   繁体   English

将列表列表与集合列表进行比较的最快方法

[英]Fastest way to compare list of lists against list of sets

Is there a faster way to do the following list comprehension?有没有更快的方法来完成以下列表理解?

ret = [
    [
        subList 
        for subList in lst 
        if set(subList) not in listOfSets
    ] 
    for lst in listOfLists
]

Restrictions限制

  • listOfSets : None listOfSets : 无
  • listOfLists : Must maintain the ordering of the subsublists during construction, but not necessarily the ordering of the sublists. listOfLists :在构建过程中必须保持子列表的顺序,但不一定保持子列表的顺序。 Ie: IE:
[[[1, 2, 3], [4, 5, 6]], [[7, 8, 9]]] != [[[1, 2, 3], [6, 5, 4]], [7, 8, 9]]

but

[[[1, 2, 3], [4, 5, 6]], [[7, 8, 9]]] = [[[7, 8, 9]], [[1, 2, 3], [4, 5, 6]]]
  • ret : ret must maintain the same ordering of the original listOfLists as detailed above. retret必须保持与上面详述的原始listOfLists相同的顺序。

My code generates the following list of lists.我的代码生成以下列表列表。 Each list contains equally sized sublists but the number of sublists van vary.每个列表都包含大小相同的子列表,但子列表的数量会有所不同。 Ie: IE:

listOfLists = [[[1, 2, 3], [4, 6, 5], [9, 8, 7]], [[11, 12, 13]], ...]

I need to filter this list of lists to remove all sublists that do not exist in a list of sets:我需要过滤此列表列表以删除集合列表中不存在的所有子列表:

listOfSets = [{1, 2, 3}, {20, 30, 15}, {6, 7, 8}, ...]

ret = [
    [
        subList 
        for subList in lst 
        if set(subList) not in listOfSets
    ] 
    for lst in listOfLists
]

ret = [[[4, 6, 5], [9, 8, 7]], [[11, 12, 13]], ...]

Note the missing [1, 2, 3] in ret.注意 ret 中缺少的[1, 2, 3]

I have tried variations of the following我尝试了以下变体

ret = [
    [
        subList 
        for subList in lst 
        if not(set(subList) in listOfSets)
    ] for lst in listOfLists
]

with the idea being that not(set(subList) in listOfSets) will return faster since it need only find a single match, but to no avail:想法是not(set(subList) in listOfSets)将返回更快,因为它只需要找到一个匹配项,但无济于事:

%timeit ret = [
    [
        subList 
        for subList in lst 
        if set(subList) not in listOfSets
    ] 
    for lst in listOfLists
]
772 µs ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit ret = [
    [
        subList 
        for subList in lst 
        if not(set(subList) in listOfSets)
   ] for lst in listOfLists
]
797 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Original Answer原始答案

I will not extend so much, given that the other answer is already very complete, but I got even better performance by using the difference between sets.鉴于其他答案已经非常完整,我不会扩展太多,但是通过使用集合之间的差异,我获得了更好的性能。

Let:让:

>>> set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]

For comparison, here is JJ Hassan's second example run on my machine (plus, note that I included the not in that was in the original question):为了比较,这是 JJ Hassan 在我的机器上运行的第二个示例(另外,请注意,我在原始问题中包含了not in ):

>>> filtered_list_of_lists_of_tuples = [[sl for sl in l if sl in set_of_tuples] for l in list_of_lists_of_tuples]
>>> filtered_list_of_lists_of_tuples
[[(1, 2, 3), (9, 8, 7)], [(11, 12, 13)]]
>>> %timeit -n1000000 -r7 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
930 ns ± 146 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Now, using the difference between sets:现在,使用集合之间的差异:

>>> filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
>>> filtered_list_of_sets_of_tuples
[{(4, 6, 5)}, set()]
>>> %timeit -n1000000 -r7 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
864 ns ± 63.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Try this option as well, as I believe that the difference in speed may become more significant with bigger lists.也试试这个选项,因为我相信速度上的差异可能会随着更大的列表而变得更加显着。

The idea behind this is that you have a superlist , which is a list of lists , each containing a sublist , or in this case a tuple .这背后的想法是你有一个超级列表,它是一个列表列表,每个列表都包含一个子列表,或者在本例中是一个tuple But, as per your requirements, the intermediate lists do not need to preserve order (only the superlist and the sublists ), and we want to take those elements that are not found in set_of_tuples .但是,根据您的要求,中间列表不需要保留顺序(只有superlistsublists ),我们希望采用那些在set_of_tuples中找不到的元素。 Consequently, the intermediate lists can be seen as set s, and the operation of taking the elements that do not belong to set_of_tuples is trivially the difference between sets.因此,中间列表可以看作是set ,取不属于set_of_tuples的元素的操作就是集合之间的区别。

Edit编辑

I just came up with a slightly faster solution by using functools and itertools .我刚刚通过使用functoolsitertools提出了一个稍微快一点的解决方案。 This new solution, however, is better only when we enough data.然而,这种新的解决方案只有在我们有足够的数据时才会更好。

Let us start with the previous solution:让我们从之前的解决方案开始:

filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]

Now, by a simple application of map , this becomes:现在,通过map的简单应用,这变为:

filtered_list_of_sets_of_tuples = [s - set_of_tuples for s in map(set, list_of_lists_of_tuples)]

Then we can use operator.sub to rewrite this as:然后我们可以使用operator.sub将其重写为:

from operator import sub
filtered_list_of_sets_of_tuples = [sub(s, set_of_tuples) for s in map(set, list_of_lists_of_tuples)]

or, using a plain list :或者,使用普通list

from operator import sub
filtered_list_of_sets_of_tuples = list(sub(s, set_of_tuples) for s in map(set, list_of_lists_of_tuples))

Finally, we use map once more, this time bringing itertools.repeat to the game:最后,我们再次使用map ,这次将itertools.repeat带入游戏:

from itertools import repeat
from operator import sub

filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))

This new method is actually the slowest given small lists:这种新方法实际上是给定小列表最慢的方法:

>>> set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]
>>> %timeit -n1000000 -r20 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
903 ns ± 168 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
>>> %timeit -n1000000 -r20 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
789 ns ± 70 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
>>> %timeit -n1000000 -r20 filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
1.28 µs ± 299 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)

But now let us define bigger lists.但是现在让我们定义更大的列表。 I used approximately the sizes you mentioned in a comment:我使用了大约您在评论中提到的尺寸:

>>> from random import randint
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]] * 100
>>> set_of_tuples = {(randint(0, 100), randint(0, 100), randint(0, 100)) for _ in range(2680)}

Using this new data, here are the results I got on my machine:使用这些新数据,这是我在机器上得到的结果:

>>> %timeit -n10000 -r20 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
65 µs ± 7.05 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
>>> %timeit -n10000 -r20 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
58.1 µs ± 6.67 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
>>> %timeit -n10000 -r20 filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
54.1 µs ± 5.34 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)

Note: the last example here is closest to a speed increase on your example, but the first two might be useful as they're even faster although they change the behaviour slightly.注意:这里的最后一个示例最接近您示例的速度增加,但前两个可能有用,因为它们更快,尽管它们稍微改变了行为。 tl;dr change your list of sets to a set of frozensets for faster membership checks tl;dr 将您的集合列表更改为一组frozensets,以便更快地进行成员检查

You mention that your list of sets can be changed to find an ideal solution, so I'd suggest using a set of something hashable like tuple s, or (in the last example) frozenset s.您提到可以更改您的集合列表以找到理想的解决方案,因此我建议使用一组可散列的东西,例如tuple或(在最后一个示例中) frozenset The type of membership test you're doing is much faster when using sets like I do in these examples.当使用我在这些示例中所做的集合时,您正在执行的成员资格测试类型要快得多。

Example 1: Using tuples with casts示例 1:使用带有强制转换的元组

In this example we convert the sub-sub-lists to tuples, and use a set of tuples.在此示例中,我们将子子列表转换为元组,并使用一组元组。 This is better because tuples are hashable and we can have a set of them.这更好,因为元组是可散列的,我们可以拥有一组。 Sets are the best container for membership checks.集合是成员检查的最佳容器。

set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}

list_of_lists_of_lists = [[[1, 2, 3], [4, 6, 5], [9, 8, 7]], [[11, 12, 13]]]

filtered_list_of_lists_of_lists =  [[sl for sl in l if tuple(sl) in set_of_tuples] for l in list_of_lists_of_lists]

filtered_list_of_lists_of_lists
# >> [[[1, 2, 3], [9, 8, 7]], [[11, 12, 13]]]

# %timeit filtered_list_of_lists_of_lists =  [[sl for sl in l if tuple(sl) in set_of_tuples] for l in list_of_lists_of_lists]
# 691 ns ± 91.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Example 2: Using tuples without casts示例 2:使用不带类型转换的元组

if we're allowed to consume a list_of_lists_of_tuples instead it becomes even faster because we don't have to cast the lists如果我们被允许使用list_of_lists_of_tuples反而会变得更快,因为我们不必强制转换列表

set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}

list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]

filtered_list_of_lists_of_tuples =  [[sl for sl in l if sl in set_of_tuples] for l in list_of_lists_of_tuples]

filtered_list_of_lists_of_tuples
# >> [[(1, 2, 3), (9, 8, 7)], [(11, 12, 13)]]

# %timeit filtered_list_of_lists_of_tuples =  [[sl for sl in l if sl in set_of_tuples] for l in list_of_lists_of_tuples]
# >> 474 ns ± 58.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

This assumes that the order of the sub-sub-list/tuple is important though, as it changes the behaviour of your code slightly.这假设子子列表/元组的顺序很重要,因为它会稍微改变代码的行为。 In your code a sub-sub-list of [1,2,3] would be included if the set {3,1,2} was in your list of sets because set([1,2,3]) == {3,1,2} .如果集合{3,1,2}在您的集合列表中,则在您的代码中将包含[1,2,3]的子列表,因为set([1,2,3]) == {3,1,2}

Example 3: Using frozensets示例 3:使用frozensets

If we wanted to preserve that behaviour then we can use frozenset which is also hashable.如果我们想保留这种行为,那么我们可以使用也可以散列的frozenset。 We'll use a set of frozensets.我们将使用一组frozensets。

set_of_frozen_sets = {frozenset([1, 2, 3]), frozenset([9,8,7]), frozenset([11,12,13]), frozenset([6,7,9])}

list_of_lists_of_lists = [[[1, 2, 3], [4, 6, 5], [9, 8, 7]], [[11, 12, 13]]]

filtered_list_of_lists_of_lists =  [[sl for sl in l if frozenset(sl) in set_of_frozen_sets] for l in list_of_lists_of_lists]

filtered_list_of_lists_of_lists
# >> [[[1, 2, 3], [9, 8, 7]], [[11, 12, 13]]]

# %timeit filtered_list_of_lists_of_lists =  [[sl for sl in l if frozenset(sl) in set_of_frozen_sets] for l in list_of_lists_of_lists]
# >> 1.13 µs ± 92.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

which appears to take a fair bit longer than the first two examples (I suppose the conversion to frozenset from list is bit more expensive).这似乎比前两个示例花费的时间要长一些(我想从 list 转换为 freezeset 有点贵)。

Please try those and see if you notice the same speed increases on your machine.请尝试这些,看看您是否注意到机器上的速度增加相同。

Note: One other possible difference in behaviour is my code will leave an empty sublist in the return value if no sub-sub-lists in that sublist match注意:另一个可能的行为差异是,如果该子列表中没有匹配的子子列表,我的代码将在返回值中留下一个空子列表

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM