[英]Fastest way to compare list of lists against list of sets
有沒有更快的方法來完成以下列表理解?
ret = [
[
subList
for subList in lst
if set(subList) not in listOfSets
]
for lst in listOfLists
]
限制
listOfSets
: 無listOfLists
:在構建過程中必須保持子列表的順序,但不一定保持子列表的順序。 IE:[[[1, 2, 3], [4, 5, 6]], [[7, 8, 9]]] != [[[1, 2, 3], [6, 5, 4]], [7, 8, 9]]
但
[[[1, 2, 3], [4, 5, 6]], [[7, 8, 9]]] = [[[7, 8, 9]], [[1, 2, 3], [4, 5, 6]]]
ret
: ret
必須保持與上面詳述的原始listOfLists
相同的順序。我的代碼生成以下列表列表。 每個列表都包含大小相同的子列表,但子列表的數量會有所不同。 IE:
listOfLists = [[[1, 2, 3], [4, 6, 5], [9, 8, 7]], [[11, 12, 13]], ...]
我需要過濾此列表列表以刪除集合列表中不存在的所有子列表:
listOfSets = [{1, 2, 3}, {20, 30, 15}, {6, 7, 8}, ...]
ret = [
[
subList
for subList in lst
if set(subList) not in listOfSets
]
for lst in listOfLists
]
ret = [[[4, 6, 5], [9, 8, 7]], [[11, 12, 13]], ...]
注意 ret 中缺少的[1, 2, 3]
。
我嘗試了以下變體
ret = [
[
subList
for subList in lst
if not(set(subList) in listOfSets)
] for lst in listOfLists
]
想法是not(set(subList) in listOfSets)
將返回更快,因為它只需要找到一個匹配項,但無濟於事:
%timeit ret = [
[
subList
for subList in lst
if set(subList) not in listOfSets
]
for lst in listOfLists
]
772 µs ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit ret = [
[
subList
for subList in lst
if not(set(subList) in listOfSets)
] for lst in listOfLists
]
797 µs ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
鑒於其他答案已經非常完整,我不會擴展太多,但是通過使用集合之間的差異,我獲得了更好的性能。
讓:
>>> set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]
為了比較,這是 JJ Hassan 在我的機器上運行的第二個示例(另外,請注意,我在原始問題中包含了not in
):
>>> filtered_list_of_lists_of_tuples = [[sl for sl in l if sl in set_of_tuples] for l in list_of_lists_of_tuples]
>>> filtered_list_of_lists_of_tuples
[[(1, 2, 3), (9, 8, 7)], [(11, 12, 13)]]
>>> %timeit -n1000000 -r7 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
930 ns ± 146 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
現在,使用集合之間的差異:
>>> filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
>>> filtered_list_of_sets_of_tuples
[{(4, 6, 5)}, set()]
>>> %timeit -n1000000 -r7 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
864 ns ± 63.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
也試試這個選項,因為我相信速度上的差異可能會隨着更大的列表而變得更加顯着。
這背后的想法是你有一個超級列表,它是一個列表列表,每個列表都包含一個子列表,或者在本例中是一個tuple
。 但是,根據您的要求,中間列表不需要保留順序(只有superlist和sublists ),我們希望采用那些在set_of_tuples
中找不到的元素。 因此,中間列表可以看作是set
,取不屬於set_of_tuples
的元素的操作就是集合之間的區別。
我剛剛通過使用functools
和itertools
提出了一個稍微快一點的解決方案。 然而,這種新的解決方案只有在我們有足夠的數據時才會更好。
讓我們從之前的解決方案開始:
filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
現在,通過map
的簡單應用,這變為:
filtered_list_of_sets_of_tuples = [s - set_of_tuples for s in map(set, list_of_lists_of_tuples)]
然后我們可以使用operator.sub
將其重寫為:
from operator import sub
filtered_list_of_sets_of_tuples = [sub(s, set_of_tuples) for s in map(set, list_of_lists_of_tuples)]
或者,使用普通list
:
from operator import sub
filtered_list_of_sets_of_tuples = list(sub(s, set_of_tuples) for s in map(set, list_of_lists_of_tuples))
最后,我們再次使用map
,這次將itertools.repeat
帶入游戲:
from itertools import repeat
from operator import sub
filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
這種新方法實際上是給定小列表最慢的方法:
>>> set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]
>>> %timeit -n1000000 -r20 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
903 ns ± 168 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
>>> %timeit -n1000000 -r20 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
789 ns ± 70 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
>>> %timeit -n1000000 -r20 filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
1.28 µs ± 299 ns per loop (mean ± std. dev. of 20 runs, 1000000 loops each)
但是現在讓我們定義更大的列表。 我使用了大約您在評論中提到的尺寸:
>>> from random import randint
>>> list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]] * 100
>>> set_of_tuples = {(randint(0, 100), randint(0, 100), randint(0, 100)) for _ in range(2680)}
使用這些新數據,這是我在機器上得到的結果:
>>> %timeit -n10000 -r20 filtered_list_of_lists_of_tuples = [[sl for sl in l if sl not in set_of_tuples] for l in list_of_lists_of_tuples]
65 µs ± 7.05 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
>>> %timeit -n10000 -r20 filtered_list_of_sets_of_tuples = [set(l) - set_of_tuples for l in list_of_lists_of_tuples]
58.1 µs ± 6.67 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
>>> %timeit -n10000 -r20 filtered_list_of_sets_of_tuples = list(map(sub, map(set, list_of_lists_of_tuples), repeat(set_of_tuples)))
54.1 µs ± 5.34 µs per loop (mean ± std. dev. of 20 runs, 10000 loops each)
注意:這里的最后一個示例最接近您示例的速度增加,但前兩個可能有用,因為它們更快,盡管它們稍微改變了行為。 tl;dr 將您的集合列表更改為一組frozensets,以便更快地進行成員檢查
您提到可以更改您的集合列表以找到理想的解決方案,因此我建議使用一組可散列的東西,例如tuple
或(在最后一個示例中) frozenset
。 當使用我在這些示例中所做的集合時,您正在執行的成員資格測試類型要快得多。
示例 1:使用帶有強制轉換的元組
在此示例中,我們將子子列表轉換為元組,並使用一組元組。 這更好,因為元組是可散列的,我們可以擁有一組。 集合是成員檢查的最佳容器。
set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
list_of_lists_of_lists = [[[1, 2, 3], [4, 6, 5], [9, 8, 7]], [[11, 12, 13]]]
filtered_list_of_lists_of_lists = [[sl for sl in l if tuple(sl) in set_of_tuples] for l in list_of_lists_of_lists]
filtered_list_of_lists_of_lists
# >> [[[1, 2, 3], [9, 8, 7]], [[11, 12, 13]]]
# %timeit filtered_list_of_lists_of_lists = [[sl for sl in l if tuple(sl) in set_of_tuples] for l in list_of_lists_of_lists]
# 691 ns ± 91.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
示例 2:使用不帶類型轉換的元組
如果我們被允許使用list_of_lists_of_tuples
反而會變得更快,因為我們不必強制轉換列表
set_of_tuples = {(1, 2, 3), (9, 8, 7), (11, 12, 13), (6, 7, 8)}
list_of_lists_of_tuples = [[(1, 2, 3), (4, 6, 5), (9, 8, 7)], [(11, 12, 13)]]
filtered_list_of_lists_of_tuples = [[sl for sl in l if sl in set_of_tuples] for l in list_of_lists_of_tuples]
filtered_list_of_lists_of_tuples
# >> [[(1, 2, 3), (9, 8, 7)], [(11, 12, 13)]]
# %timeit filtered_list_of_lists_of_tuples = [[sl for sl in l if sl in set_of_tuples] for l in list_of_lists_of_tuples]
# >> 474 ns ± 58.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
這假設子子列表/元組的順序很重要,因為它會稍微改變代碼的行為。 如果集合{3,1,2}
在您的集合列表中,則在您的代碼中將包含[1,2,3]
的子列表,因為set([1,2,3]) == {3,1,2}
。
示例 3:使用frozensets
如果我們想保留這種行為,那么我們可以使用也可以散列的frozenset。 我們將使用一組frozensets。
set_of_frozen_sets = {frozenset([1, 2, 3]), frozenset([9,8,7]), frozenset([11,12,13]), frozenset([6,7,9])}
list_of_lists_of_lists = [[[1, 2, 3], [4, 6, 5], [9, 8, 7]], [[11, 12, 13]]]
filtered_list_of_lists_of_lists = [[sl for sl in l if frozenset(sl) in set_of_frozen_sets] for l in list_of_lists_of_lists]
filtered_list_of_lists_of_lists
# >> [[[1, 2, 3], [9, 8, 7]], [[11, 12, 13]]]
# %timeit filtered_list_of_lists_of_lists = [[sl for sl in l if frozenset(sl) in set_of_frozen_sets] for l in list_of_lists_of_lists]
# >> 1.13 µs ± 92.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
這似乎比前兩個示例花費的時間要長一些(我想從 list 轉換為 freezeset 有點貴)。
請嘗試這些,看看您是否注意到機器上的速度增加相同。
注意:另一個可能的行為差異是,如果該子列表中沒有匹配的子子列表,我的代碼將在返回值中留下一個空子列表
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.