簡體   English   中英

將列表列表與集合列表進行比較的最快方法

[英]Fastest way to compare list of lists against list of sets

有沒有更快的方法來完成以下列表理解?

ret = [
    [
        subList 
        for subList in lst 
        if set(subList) not in listOfSets
    ] 
    for lst in listOfLists
]

限制

  • listOfSets : 無
  • listOfLists :在構建過程中必須保持子列表的順序,但不一定保持子列表的順序。 IE:
[[[1, 2, 3], [4, 5, 6]], [[7, 8, 9]]] != [[[1, 2, 3], [6, 5, 4]], [7, 8, 9]]

[[[1, 2, 3], [4, 5, 6]], [[7, 8, 9]]] = [[[7, 8, 9]], [[1, 2, 3], [4, 5, 6]]]
  • retret必須保持與上面詳述的原始listOfLists相同的順序。

我的代碼生成以下列表列表。 每個列表都包含大小相同的子列表,但子列表的數量會有所不同。 IE:

listOfLists = [[[1, 2, 3], [4, 6, 5], [9, 8, 7]], [[11, 12, 13]], ...]

我需要過濾此列表列表以刪除集合列表中不存在的所有子列表:

listOfSets = [{1, 2, 3}, {20, 30, 15}, {6, 7, 8}, ...]

ret = [
    [
        subList 
        for subList in lst 
        if set(subList) not in listOfSets
    ] 
    for lst in listOfLists
]

ret = [[[4, 6, 5], [9, 8, 7]], [[11, 12, 13]], ...]

注意 ret 中缺少的[1, 2, 3]

我嘗試了以下變體

ret = [
    [
        subList 
        for subList in lst 
        if not(set(subList) in listOfSets)
    ] for lst in listOfLists
]

想法是not(set(subList) in listOfSets)將返回更快,因為它只需要找到一個匹配項,但無濟於事:

%timeit ret = [
    [
        subList 
        for subList in lst 
        if set(subList) not in listOfSets
    ] 
    for lst in listOfLists
]
772 µs ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit ret = [
    [
        subList 
        for subList in lst 
        if not(set(subList) in listOfSets)
   ] for lst in listOfLists
]
797 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

原始答案

鑒於其他答案已經非常完整,我不會擴展太多,但是通過使用集合之間的差異,我獲得了更好的性能。

讓:

>>> set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]

為了比較,這是 JJ Hassan 在我的機器上運行的第二個示例(另外,請注意,我在原始問題中包含了not in ):

>>> filtered_list_of_lists_of_tuples = [[sl for sl in l if sl in set_of_tuples] for l in list_of_lists_of_tuples]
>>> filtered_list_of_lists_of_tuples
[[(1, 2, 3), (9, 8, 7)], [(11, 12, 13)]]
>>> %timeit -n1000000 -r7 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
930 ns ± 146 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

現在,使用集合之間的差異:

>>> filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
>>> filtered_list_of_sets_of_tuples
[{(4, 6, 5)}, set()]
>>> %timeit -n1000000 -r7 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
864 ns ± 63.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

也試試這個選項,因為我相信速度上的差異可能會隨着更大的列表而變得更加顯着。

這背后的想法是你有一個超級列表,它是一個列表列表,每個列表都包含一個子列表,或者在本例中是一個tuple 但是,根據您的要求,中間列表不需要保留順序(只有superlistsublists ),我們希望采用那些在set_of_tuples中找不到的元素。 因此,中間列表可以看作是set ,取不屬於set_of_tuples的元素的操作就是集合之間的區別。

編輯

我剛剛通過使用functoolsitertools提出了一個稍微快一點的解決方案。 然而,這種新的解決方案只有在我們有足夠的數據時才會更好。

讓我們從之前的解決方案開始:

filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]

現在,通過map的簡單應用,這變為:

filtered_list_of_sets_of_tuples = [s - set_of_tuples for s in map(set, list_of_lists_of_tuples)]

然后我們可以使用operator.sub將其重寫為:

from operator import sub
filtered_list_of_sets_of_tuples = [sub(s, set_of_tuples) for s in map(set, list_of_lists_of_tuples)]

或者,使用普通list

from operator import sub
filtered_list_of_sets_of_tuples = list(sub(s, set_of_tuples) for s in map(set, list_of_lists_of_tuples))

最后,我們再次使用map ,這次將itertools.repeat帶入游戲:

from itertools import repeat
from operator import sub

filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))

這種新方法實際上是給定小列表最慢的方法:

>>> set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]
>>> %timeit -n1000000 -r20 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
903 ns ± 168 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
>>> %timeit -n1000000 -r20 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
789 ns ± 70 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
>>> %timeit -n1000000 -r20 filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
1.28 µs ± 299 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)

但是現在讓我們定義更大的列表。 我使用了大約您在評論中提到的尺寸:

>>> from random import randint
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]] * 100
>>> set_of_tuples = {(randint(0, 100), randint(0, 100), randint(0, 100)) for _ in range(2680)}

使用這些新數據,這是我在機器上得到的結果:

>>> %timeit -n10000 -r20 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
65 µs ± 7.05 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
>>> %timeit -n10000 -r20 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
58.1 µs ± 6.67 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
>>> %timeit -n10000 -r20 filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
54.1 µs ± 5.34 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)

注意:這里的最后一個示例最接近您示例的速度增加,但前兩個可能有用,因為它們更快,盡管它們稍微改變了行為。 tl;dr 將您的集合列表更改為一組frozensets,以便更快地進行成員檢查

您提到可以更改您的集合列表以找到理想的解決方案,因此我建議使用一組可散列的東西,例如tuple或(在最后一個示例中) frozenset 當使用我在這些示例中所做的集合時,您正在執行的成員資格測試類型要快得多。

示例 1:使用帶有強制轉換的元組

在此示例中,我們將子子列表轉換為元組,並使用一組元組。 這更好,因為元組是可散列的,我們可以擁有一組。 集合是成員檢查的最佳容器。

set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}

list_of_lists_of_lists = [[[1, 2, 3], [4, 6, 5], [9, 8, 7]], [[11, 12, 13]]]

filtered_list_of_lists_of_lists =  [[sl for sl in l if tuple(sl) in set_of_tuples] for l in list_of_lists_of_lists]

filtered_list_of_lists_of_lists
# >> [[[1, 2, 3], [9, 8, 7]], [[11, 12, 13]]]

# %timeit filtered_list_of_lists_of_lists =  [[sl for sl in l if tuple(sl) in set_of_tuples] for l in list_of_lists_of_lists]
# 691 ns ± 91.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

示例 2:使用不帶類型轉換的元組

如果我們被允許使用list_of_lists_of_tuples反而會變得更快,因為我們不必強制轉換列表

set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}

list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]

filtered_list_of_lists_of_tuples =  [[sl for sl in l if sl in set_of_tuples] for l in list_of_lists_of_tuples]

filtered_list_of_lists_of_tuples
# >> [[(1, 2, 3), (9, 8, 7)], [(11, 12, 13)]]

# %timeit filtered_list_of_lists_of_tuples =  [[sl for sl in l if sl in set_of_tuples] for l in list_of_lists_of_tuples]
# >> 474 ns ± 58.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

這假設子子列表/元組的順序很重要,因為它會稍微改變代碼的行為。 如果集合{3,1,2}在您的集合列表中,則在您的代碼中將包含[1,2,3]的子列表,因為set([1,2,3]) == {3,1,2}

示例 3:使用frozensets

如果我們想保留這種行為,那么我們可以使用也可以散列的frozenset。 我們將使用一組frozensets。

set_of_frozen_sets = {frozenset([1, 2, 3]), frozenset([9,8,7]), frozenset([11,12,13]), frozenset([6,7,9])}

list_of_lists_of_lists = [[[1, 2, 3], [4, 6, 5], [9, 8, 7]], [[11, 12, 13]]]

filtered_list_of_lists_of_lists =  [[sl for sl in l if frozenset(sl) in set_of_frozen_sets] for l in list_of_lists_of_lists]

filtered_list_of_lists_of_lists
# >> [[[1, 2, 3], [9, 8, 7]], [[11, 12, 13]]]

# %timeit filtered_list_of_lists_of_lists =  [[sl for sl in l if frozenset(sl) in set_of_frozen_sets] for l in list_of_lists_of_lists]
# >> 1.13 µs ± 92.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

這似乎比前兩個示例花費的時間要長一些(我想從 list 轉換為 freezeset 有點貴)。

請嘗試這些,看看您是否注意到機器上的速度增加相同。

注意:另一個可能的行為差異是,如果該子列表中沒有匹配的子子列表,我的代碼將在返回值中留下一個空子列表

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM