[英]Fastest way to uniquify a list of dicts with a rule?
我有一個字典列表:
list1 = [
{ 'T': 1234, 'V': 10, 'O': 1 },
{ 'T': 2345, 'V': 50, 'O': 5 },
{ 'T': 2345, 'V': 30, 'O': 3 },
{ 'T': 3456, 'V': 40, 'O': 91 },
]
我需要對這些進行唯一排序:
T
應該是唯一的V
較大,都應優先哪個應該產生:
[
{'T': 1234, 'V': 10, 'O': 1},
{'T': 2345, 'V': 50, 'O': 5},
{'T': 3456, 'V': 40, 'O': 91}
]
我想出了這個:
interm = {o['T']: o for o in list1}
for o in list1:
if o['V'] > interm[o['T']]['V']:
interm[o['T']] = o
但是,我有效地迭代了列表兩次,並多次設置字典值。 這感覺它可以改進,但我不知道我該怎么做。
在給定的約束下,有沒有更快的方法來實現這一點?
假設list1
已經按T
排序,您可以使用itertools.groupby
。
from itertools import groupby
li = [
{ 'T': 1234, 'V': 10, 'O': 1 },
{ 'T': 2345, 'V': 50, 'O': 5 },
{ 'T': 2345, 'V': 30, 'O': 3 },
{ 'T': 3456, 'V': 40, 'O': 91 },
]
output = [max(group, key=lambda d: d['V'])
for _, group in groupby(li, key=lambda d: d['T'])]
print(output)
# [{'T': 1234, 'V': 10, 'O': 1}, {'T': 2345, 'V': 50, 'O': 5}, {'T': 3456, 'V': 40, 'O': 91}]
如果不是, groupby
仍然可以與sort
一起使用以實現 O(nlogn) 解決方案
order_by_t = lambda d: d['T']
li.sort(key=order_by_t)
output = [max(group, key=lambda d: d['V'])
for _, group in groupby(li, key=order_by_t)]
這是循序漸進的方法。 它迭代您的列表一次並構建一個新列表:
list1 = [
{ 'T': 1234, 'V': 10, 'O': 1 },
{ 'T': 2345, 'V': 50, 'O': 5 },
{ 'T': 2345, 'V': 30, 'O': 3 },
{ 'T': 3456, 'V': 40, 'O': 91 },
]
# add this step if not already sorted by T
# list1 = sorted(list1, key = lambda x: x["T"])
list2 = []
for e in list1:
t, v, o = e["T"], e["V"], e["O"]
# we already stored something and same T
if list2 and list2[-1]["T"] == t:
# smaller V ?
if list2[-1]["V"] < v:
# overwrite dict elements
list2[-1]["V"] = v
list2[-1]["O"] = o
# did not store anything or other T
else:
list2.append(e)
print(list2)
輸出:
[{'T': 1234, 'O': 1, 'V': 10},
{'T': 2345, 'O': 5, 'V': 50},
{'T': 3456, 'O': 91, 'V': 40}]
假設您的列表已經按T
排序,您可以簡單地跟蹤一次通過的最大V
元素,如果找到則替換最大值:
list1 = [
{ 'T': 1234, 'V': 10, 'O': 1 },
{ 'T': 2345, 'V': 50, 'O': 5 },
{ 'T': 2345, 'V': 30, 'O': 3 },
{ 'T': 3456, 'V': 40, 'O': 91 },
]
unique = {}
for dic in list1:
key = dic['T']
found = unique.get(key)
# If value found and doesn't exceed current maximum, just ignore
if found and dic['V'] <= found['V']:
continue
# otherwise just update normally
unique[key] = dic
print(list(unique.values()))
# [{'T': 1234, 'V': 10, 'O': 1}, {'T': 2345, 'V': 50, 'O': 5}, {'T': 3456, 'V': 40, 'O': 91}]
如果您的列表不能保證按T
排序,您可以預先使用T
作為排序key
進行排序:
from operator import itemgetter
sorted(list1, key=itemgetter('T'))
使用上面的operator.itemgetter
與使用相同:
sorted(list1, key=lambda x: x['T'])
詢問“最快”方式的問題 - 我用給定的數據對當前的方法進行計時 - 似乎 RoadRunners 在這個數據集上的運行速度最快,我的排在第二位,DeepSpace 的解決方案排在第三位。
>>> import timeit
>>> timeit.timeit(p1,setup=up) # https://stackoverflow.com/a/54957067/7505395
2.5858893489556913
>>> timeit.timeit(p2,setup=up) # https://stackoverflow.com/a/54957090/7505395
0.8051884429499854
>>> timeit.timeit(p3,setup=up) # https://stackoverflow.com/a/54957156/7505395
0.7680418536661247
測試代碼:
up = """from itertools import groupby
li = [
{ 'T': 1234, 'V': 10, 'O': 1 },
{ 'T': 2345, 'V': 50, 'O': 5 },
{ 'T': 2345, 'V': 30, 'O': 3 },
{ 'T': 3456, 'V': 40, 'O': 91 },
]"""
來源: https : //stackoverflow.com/a/54957067/7505395
p1 = """
# li.sort(key=lambda x:x["T"]) # for the random data
output = [max(group, key=lambda d: d['V'])
for _, group in groupby(li, key=lambda d: d['T'])]
"""
來源: https : //stackoverflow.com/a/54957090/7505395
p2 = """
# li.sort(key=lambda x:x["T"]) # for the random data
list2 = []
for e in li:
t, v, o = e["T"], e["V"], e["O"]
# we already stored something and same T
if list2 and list2[-1]["T"] == t:
# smaller V ?
if list2[-1]["V"] < v:
# overwrite dict elements
list2[-1]["V"] = v
list2[-1]["O"] = o
# did not store anything or other T
else:
list2.append(e)
"""
來源: https : //stackoverflow.com/a/54957156/7505395
p3 = """
unique = {}
for dic in li:
key = dic['T']
found = unique.get(key)
# If value found and doesn't exceed current maximum, just ignore
if found and dic['V'] <= found['V']:
continue
# otherwise just update normally
unique[key] = dic
"""
編輯(隨機 10k 數據 - 排序和未排序)以查看它是否與數據相關:
隨機數據:10000 個數據點, T [1,100] - V [10,20,..,200] - "O" [1,1000000]
up = """
from itertools import groupby
import random
random.seed(42)
def r():
# few T so we get plenty of dupes
return {"T":random.randint(1,100), "V":random.randint(1,20)*10,
"O":random.randint(1,1000000)}
li = [ r() for _ in range(10000)]
# li.sort(key=lambda x:x["T"]) # uncommented for pre-sorted run
"""
來源: https : //stackoverflow.com/a/54957067/7505395
p1 = """
li.sort(key=lambda x:x["T"]) # needs sorting, commented for pre-sorted run
output = [max(group, key=lambda d: d['V'])
for _, group in groupby(li, key=lambda d: d['T'])]
"""
來源: https : //stackoverflow.com/a/54957090/7505395
p2 = """
li.sort(key=lambda x:x["T"]) # needs sorting, commented for pre-sorted run
list2 = []
for e in li:
t, v, o = e["T"], e["V"], e["O"]
# we already stored something and same T
if list2 and list2[-1]["T"] == t:
# smaller V ?
if list2[-1]["V"] < v:
# overwrite dict elements
list2[-1]["V"] = v
list2[-1]["O"] = o
# did not store anything or other T
else:
list2.append(e)
"""
來源: https : //stackoverflow.com/a/54957156/7505395
p3 = """
unique = {}
for dic in li:
key = dic['T']
found = unique.get(key)
# If value found and doesn't exceed current maximum, just ignore
if found and dic['V'] <= found['V']:
continue
# otherwise just update normally
unique[key] = dic
"""
來源: https : //stackoverflow.com/a/54957363/7505395
p4 = """
t_v = {}
result = []
for row in li:
if not t_v.get(row['T']):
t_v[row['T']] = (row['V'], len(result))
result.append(row)
continue
if row['V'] > t_v[row['T']][0]:
t_v[row['T']] = (row['V'], t_v[row['T']][1])
result[t_v[row['T']][1]] = row
"""
在 p1/p2 內排序的結果:
import timeit
timeit.timeit(p1,setup=up, number=100) 0.4958197257468498 4th
timeit.timeit(p2,setup=up, number=100) 0.4506078658396253 3rd
timeit.timeit(p3,setup=up, number=100) 0.24399979946368378 1st
timeit.timeit(p4,setup=up, number=100) 0.2561938286132954 2nd
預排序數據的結果:
timeit.timeit(p1,setup=up, number=100) 0.3046940103986765 3rd
timeit.timeit(p2,setup=up, number=100) 0.33943337437485366 4th
timeit.timeit(p3,setup=up, number=100) 0.2795306502784811 1st
timeit.timeit(p4,setup=up, number=100) 0.29027710723995326 2nd
為此,在未排序表的單個循環中,我創建了一個查找表來存儲有關當前結果數組的信息。 查找表將“T”存儲為帶有“V”值和結果列表中項目索引的鍵。
循環遍歷數據時,您可以根據查找表鍵檢查“T”值。
如果密鑰不存在,請添加它。
如果它確實將其值與行“V”值進行比較。
如果當前行 'V' 更大,您可以使用存儲的索引來替換前一行。
arr = [
{'T': 2345, 'V': 50, 'O': 5},
{'T': 1234, 'V': 10, 'O': 1},
{'T': 2345, 'V': 30, 'O': 3},
{'T': 3456, 'V': 40, 'O': 91},
]
def filter_out_lowest_values(arr):
lookup = {}
result = []
for row in arr:
row_key, row_value = row['T'], row['V']
if not lookup.get(row_key):
lookup[row_key] = (row_value, len(result))
result.append(row)
continue
lookup_value, result_index = lookup[row_key][0], lookup[row_key][1]
if row_value > lookup_value:
lookup[row_key] = (row_value, result_index)
result[result_index] = row
return result
print(filter_out_lowest_values(arr))
結果:
> [{'T': 1234, 'V': 40, 'O': 91}, {'T': 2345, 'V': 150, 'O': 5}, {'T': 3456, 'V': 40, 'O': 91}]
要回答統一列表的最快方法的問題,請參閱下面的基准。
它高度依賴於提供的數據。 列表的長度、是否排序以及唯一鍵的數量都起着一定的作用。
根據我的基准測試,我發現 Patrick Artners 是排序列表中最快的。 一旦查找表完全填充,我自己在未排序列表中是最快的。
對於每個n
值,每個腳本都運行了 100 次,繪制了最快(最小)運行時間。
Unsorted Benchmarks
N = 10
------
| min | avg | max | func | name |
|---------------|---------------|---------------|----------------------------|------------------|
| 0.000006437 | 0.000007293 | 0.000022173 | sarcoma | sarcoma |
| 0.000007153 | 0.000007646 | 0.000017881 | road_runner_with_sort | RoadRunner |
| 0.000007868 | 0.000008337 | 0.000013351 | patrick_artner_with_sort | Patrick_Artner |
| 0.000015497 | 0.000017719 | 0.000026703 | deep_space_with_sort | DeepSpace |
N = 100
------
| min | avg | max | func | name |
|---------------|---------------|---------------|----------------------------|------------------|
| 0.000043154 | 0.000045519 | 0.000057936 | road_runner_with_sort | RoadRunner |
| 0.000053883 | 0.000056396 | 0.000069141 | sarcoma | sarcoma |
| 0.000055075 | 0.000057223 | 0.000063181 | patrick_artner_with_sort | Patrick_Artner |
| 0.000135660 | 0.000145028 | 0.000174046 | deep_space_with_sort | DeepSpace |
N = 1000
------
| min | avg | max | func | name |
|---------------|---------------|---------------|----------------------------|------------------|
| 0.000294447 | 0.000559096 | 0.000992775 | road_runner_with_sort | RoadRunner |
| 0.000327826 | 0.000374844 | 0.000650883 | patrick_artner_with_sort | Patrick_Artner |
| 0.000344276 | 0.000605364 | 0.002207994 | sarcoma | sarcoma |
| 0.000758171 | 0.001031160 | 0.002290487 | deep_space_with_sort | DeepSpace |
N = 10000
------
| min | avg | max | func | name |
|---------------|---------------|---------------|----------------------------|------------------|
| 0.003607988 | 0.003875387 | 0.005285978 | road_runner_with_sort | RoadRunner |
| 0.003780127 | 0.004181504 | 0.005370378 | sarcoma | sarcoma |
| 0.003986597 | 0.004258037 | 0.006756544 | patrick_artner_with_sort | Patrick_Artner |
| 0.007097244 | 0.007444410 | 0.009983778 | deep_space_with_sort | DeepSpace |
N = 25000
------
| min | avg | max | func | name |
|---------------|---------------|---------------|----------------------------|------------------|
| 0.009672165 | 0.010055504 | 0.011536598 | sarcoma | sarcoma |
| 0.019844294 | 0.022260010 | 0.027792931 | road_runner_with_sort | RoadRunner |
| 0.020462751 | 0.022415347 | 0.029330730 | patrick_artner_with_sort | Patrick_Artner |
| 0.024955750 | 0.027981100 | 0.031506777 | deep_space_with_sort | DeepSpace
Sorted Benchmarks
N = 10
------
| min | avg | max | func | name |
|---------------|---------------|---------------|------------------|------------------|
| 0.000002861 | 0.000003138 | 0.000005960 | road_runner | RoadRunner |
| 0.000002861 | 0.000003231 | 0.000012398 | patrick_artner | Patrick_Artner |
| 0.000004292 | 0.000004461 | 0.000007629 | sarcoma | sarcoma |
| 0.000008821 | 0.000009136 | 0.000011921 | deep_space | DeepSpace |
N = 100
------
| min | avg | max | func | name |
|---------------|---------------|---------------|------------------|------------------|
| 0.000020027 | 0.000020833 | 0.000037909 | road_runner | RoadRunner |
| 0.000021458 | 0.000024126 | 0.000087738 | patrick_artner | Patrick_Artner |
| 0.000033140 | 0.000034373 | 0.000049591 | sarcoma | sarcoma |
| 0.000072241 | 0.000073054 | 0.000085592 | deep_space | DeepSpace |
N = 1000
------
| min | avg | max | func | name |
|---------------|---------------|---------------|------------------|------------------|
| 0.000200748 | 0.000207791 | 0.000290394 | patrick_artner | Patrick_Artner |
| 0.000207186 | 0.000219207 | 0.000277519 | road_runner | RoadRunner |
| 0.000333071 | 0.000369296 | 0.000570774 | sarcoma | sarcoma |
| 0.000635624 | 0.000721800 | 0.001362801 | deep_space | DeepSpace |
N = 10000
------
| min | avg | max | func | name |
|---------------|---------------|---------------|------------------|------------------|
| 0.002717972 | 0.002925014 | 0.003932238 | patrick_artner | Patrick_Artner |
| 0.002796888 | 0.003489044 | 0.004799843 | road_runner | RoadRunner |
| 0.004704714 | 0.005460148 | 0.008680582 | sarcoma | sarcoma |
| 0.005549192 | 0.006385834 | 0.009561062 | deep_space | DeepSpace |
N = 25000
------
| min | avg | max | func | name |
|---------------|---------------|---------------|------------------|------------------|
| 0.010142803 | 0.011239243 | 0.015279770 | patrick_artner | Patrick_Artner |
| 0.011211157 | 0.012368391 | 0.014696836 | road_runner | RoadRunner |
| 0.014389753 | 0.015374193 | 0.022623777 | sarcoma | sarcoma |
| 0.016021967 | 0.016560717 | 0.019297361 | deep_space | DeepSpace |
|
可以在以下位置找到基准腳本: https : //github.com/sarcoma/python-script-benchmark-tools/blob/master/examples/filter_out_lowest_duplicates.py
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.