簡體   English   中英

用規則統一字典列表的最快方法?

[英]Fastest way to uniquify a list of dicts with a rule?

我有一個字典列表:

list1 = [
  { 'T': 1234, 'V': 10, 'O': 1 },
  { 'T': 2345, 'V': 50, 'O': 5 },
  { 'T': 2345, 'V': 30, 'O': 3 },
  { 'T': 3456, 'V': 40, 'O': 91 },
]

我需要對這些進行唯一排序:

  • T應該是唯一的
  • 無論哪個 dict 的V較大,都應優先

哪個應該產生:

[
  {'T': 1234, 'V': 10, 'O': 1}, 
  {'T': 2345, 'V': 50, 'O': 5}, 
  {'T': 3456, 'V': 40, 'O': 91}
]

我想出了這個:

interm = {o['T']: o for o in list1}
for o in list1:
  if o['V'] > interm[o['T']]['V']:
    interm[o['T']] = o

但是,我有效地迭代了列表兩次,並多次設置字典值。 這感覺它可以改進,但我不知道我該怎么做。

在給定的約束下,有沒有更快的方法來實現這一點?

假設list1已經按T排序,您可以使用itertools.groupby

from itertools import groupby

li = [
  { 'T': 1234, 'V': 10, 'O': 1 },
  { 'T': 2345, 'V': 50, 'O': 5 },
  { 'T': 2345, 'V': 30, 'O': 3 },
  { 'T': 3456, 'V': 40, 'O': 91 },
]

output = [max(group, key=lambda d: d['V'])
          for _, group in groupby(li, key=lambda d: d['T'])]

print(output)
# [{'T': 1234, 'V': 10, 'O': 1}, {'T': 2345, 'V': 50, 'O': 5}, {'T': 3456, 'V': 40, 'O': 91}]

如果不是, groupby仍然可以與sort一起使用以實現 O(nlogn) 解決方案

order_by_t = lambda d: d['T']

li.sort(key=order_by_t)

output = [max(group, key=lambda d: d['V'])
          for _, group in groupby(li, key=order_by_t)]

這是循序漸進的方法。 它迭代您的列表一次並構建一個新列表:

list1 = [
  { 'T': 1234, 'V': 10, 'O': 1 },
  { 'T': 2345, 'V': 50, 'O': 5 },
  { 'T': 2345, 'V': 30, 'O': 3 },
  { 'T': 3456, 'V': 40, 'O': 91 },
]

# add this step if not already sorted by T
# list1 = sorted(list1, key = lambda x: x["T"]) 

list2 = []
for e in list1:
    t, v, o = e["T"], e["V"], e["O"]

    # we already stored something and same T
    if list2 and list2[-1]["T"] == t:

        # smaller V ?
        if list2[-1]["V"] < v:
            # overwrite dict elements
            list2[-1]["V"] = v
            list2[-1]["O"] = o

    # did not store anything or other T
    else:
        list2.append(e)

print(list2)

輸出:

[{'T': 1234, 'O': 1, 'V': 10}, 
 {'T': 2345, 'O': 5, 'V': 50}, 
 {'T': 3456, 'O': 91, 'V': 40}]

假設您的列表已經按T排序,您可以簡單地跟蹤一次通過的最大V元素,如果找到則替換最大值:

list1 = [
    { 'T': 1234, 'V': 10, 'O': 1 },
    { 'T': 2345, 'V': 50, 'O': 5 },
    { 'T': 2345, 'V': 30, 'O': 3 },
    { 'T': 3456, 'V': 40, 'O': 91 },
] 

unique = {}
for dic in list1:
    key = dic['T']
    found = unique.get(key)

    # If value found and doesn't exceed current maximum, just ignore
    if found and dic['V'] <= found['V']:
        continue

    # otherwise just update normally
    unique[key] = dic

print(list(unique.values()))
# [{'T': 1234, 'V': 10, 'O': 1}, {'T': 2345, 'V': 50, 'O': 5}, {'T': 3456, 'V': 40, 'O': 91}]

如果您的列表不能保證按T排序,您可以預先使用T作為排序key進行排序:

from operator import itemgetter

sorted(list1, key=itemgetter('T'))

使用上面的operator.itemgetter與使用相同:

sorted(list1, key=lambda x: x['T'])

詢問“最快”方式的問題 - 我用給定的數據對當前的方法進行計時 - 似乎 RoadRunners 在這個數據集上的運行速度最快,我的排在第二位,DeepSpace 的解決方案排在第三位。

>>> import timeit
>>> timeit.timeit(p1,setup=up)        # https://stackoverflow.com/a/54957067/7505395
2.5858893489556913
>>> timeit.timeit(p2,setup=up)        # https://stackoverflow.com/a/54957090/7505395
0.8051884429499854
>>> timeit.timeit(p3,setup=up)        # https://stackoverflow.com/a/54957156/7505395
0.7680418536661247

測試代碼:

up = """from itertools import groupby

li = [
{ 'T': 1234, 'V': 10, 'O': 1 },
{ 'T': 2345, 'V': 50, 'O': 5 },
{ 'T': 2345, 'V': 30, 'O': 3 },
{ 'T': 3456, 'V': 40, 'O': 91 },
]"""

來源: https : //stackoverflow.com/a/54957067/7505395

p1 = """
# li.sort(key=lambda x:x["T"]) # for the random data
output = [max(group, key=lambda d: d['V'])
        for _, group in groupby(li, key=lambda d: d['T'])]
"""

來源: https : //stackoverflow.com/a/54957090/7505395

p2 = """
# li.sort(key=lambda x:x["T"]) # for the random data
list2 = []
for e in li:
    t, v, o = e["T"], e["V"], e["O"]

    # we already stored something and same T
    if list2 and list2[-1]["T"] == t:

        # smaller V ?
        if list2[-1]["V"] < v:
            # overwrite dict elements
            list2[-1]["V"] = v
            list2[-1]["O"] = o

    # did not store anything or other T
    else:
        list2.append(e)
"""

來源: https : //stackoverflow.com/a/54957156/7505395

p3 = """
unique = {}
for dic in li:
    key = dic['T']
    found = unique.get(key)

    # If value found and doesn't exceed current maximum, just ignore
    if found and dic['V'] <= found['V']:
        continue

    # otherwise just update normally
    unique[key] = dic 
"""

編輯(隨機 10k 數據 - 排序和未排序)以查看它是否與數據相關:

隨機數據:10000 個數據點, T [1,100] - V [10,20,..,200] - "O" [1,1000000]

up = """
from itertools import groupby
import random

random.seed(42)

def r():
    # few T so we get plenty of dupes
    return {"T":random.randint(1,100), "V":random.randint(1,20)*10, 
            "O":random.randint(1,1000000)}
li = [ r() for _ in range(10000)]

# li.sort(key=lambda x:x["T"])  # uncommented for pre-sorted run

"""

來源: https : //stackoverflow.com/a/54957067/7505395

p1 = """
li.sort(key=lambda x:x["T"])  # needs sorting, commented for pre-sorted run
output = [max(group, key=lambda d: d['V'])
        for _, group in groupby(li, key=lambda d: d['T'])]
"""

來源: https : //stackoverflow.com/a/54957090/7505395

p2 = """ 
li.sort(key=lambda x:x["T"])  # needs sorting, commented for pre-sorted run
list2 = []
for e in li:
    t, v, o = e["T"], e["V"], e["O"]

    # we already stored something and same T
    if list2 and list2[-1]["T"] == t:

        # smaller V ?
        if list2[-1]["V"] < v:
            # overwrite dict elements
            list2[-1]["V"] = v
            list2[-1]["O"] = o

    # did not store anything or other T
    else:
        list2.append(e)
"""

來源: https : //stackoverflow.com/a/54957156/7505395

p3 = """
unique = {}
for dic in li:
    key = dic['T']
    found = unique.get(key)

    # If value found and doesn't exceed current maximum, just ignore
    if found and dic['V'] <= found['V']:
        continue

    # otherwise just update normally
    unique[key] = dic 
"""

來源: https : //stackoverflow.com/a/54957363/7505395

p4 = """ 
t_v = {}
result = []
for row in li:
    if not t_v.get(row['T']):
        t_v[row['T']] = (row['V'], len(result))
        result.append(row)
        continue

    if row['V'] > t_v[row['T']][0]:
        t_v[row['T']] = (row['V'], t_v[row['T']][1])
        result[t_v[row['T']][1]] = row
"""

在 p1/p2 內排序的結果:

import timeit
timeit.timeit(p1,setup=up, number=100)       0.4958197257468498      4th
timeit.timeit(p2,setup=up, number=100)       0.4506078658396253      3rd
timeit.timeit(p3,setup=up, number=100)       0.24399979946368378     1st
timeit.timeit(p4,setup=up, number=100)       0.2561938286132954      2nd

預排序數據的結果:

timeit.timeit(p1,setup=up, number=100)       0.3046940103986765      3rd
timeit.timeit(p2,setup=up, number=100)       0.33943337437485366     4th
timeit.timeit(p3,setup=up, number=100)       0.2795306502784811      1st
timeit.timeit(p4,setup=up, number=100)       0.29027710723995326     2nd

為此,在未排序表的單個循環中,我創建了一個查找表來存儲有關當前結果數組的信息。 查找表將“T”存儲為帶有“V”值和結果列表中項目索引的鍵。

循環遍歷數據時,您可以根據查找表鍵檢查“T”值。

如果密鑰不存在,請添加它。

如果它確實將其值與行“V”值進行比較。

如果當前行 'V' 更大,您可以使用存儲的索引來替換前一行。

arr = [
    {'T': 2345, 'V': 50, 'O': 5},
    {'T': 1234, 'V': 10, 'O': 1},
    {'T': 2345, 'V': 30, 'O': 3},
    {'T': 3456, 'V': 40, 'O': 91},
]


def filter_out_lowest_values(arr):
lookup = {}
result = []
for row in arr:
    row_key, row_value = row['T'], row['V']
    if not lookup.get(row_key):
        lookup[row_key] = (row_value, len(result))
        result.append(row)
        continue

    lookup_value, result_index = lookup[row_key][0], lookup[row_key][1]
    if row_value > lookup_value:
        lookup[row_key] = (row_value, result_index)
        result[result_index] = row

return result


print(filter_out_lowest_values(arr))

結果:

> [{'T': 1234, 'V': 40, 'O': 91}, {'T': 2345, 'V': 150, 'O': 5}, {'T': 3456, 'V': 40, 'O': 91}]

要回答統一列表的最快方法的問題,請參閱下面的基准。

它高度依賴於提供的數據。 列表的長度、是否排序以及唯一鍵的數量都起着一定的作用。

根據我的基准測試,我發現 Patrick Artners 是排序列表中最快的。 一旦查找表完全填充,我自己在未排序列表中是最快的。

基准比較

對於每個n值,每個腳本都運行了 100 次,繪制了最快(最小)運行時間。

未排序的數據基准

Unsorted Benchmarks
N = 10
------
|  min          |  avg          |  max          |  func                      |  name            |
|---------------|---------------|---------------|----------------------------|------------------|
|  0.000006437  |  0.000007293  |  0.000022173  |  sarcoma                   |  sarcoma         |
|  0.000007153  |  0.000007646  |  0.000017881  |  road_runner_with_sort     |  RoadRunner      |
|  0.000007868  |  0.000008337  |  0.000013351  |  patrick_artner_with_sort  |  Patrick_Artner  |
|  0.000015497  |  0.000017719  |  0.000026703  |  deep_space_with_sort      |  DeepSpace       |

N = 100
------
|  min          |  avg          |  max          |  func                      |  name            |
|---------------|---------------|---------------|----------------------------|------------------|
|  0.000043154  |  0.000045519  |  0.000057936  |  road_runner_with_sort     |  RoadRunner      |
|  0.000053883  |  0.000056396  |  0.000069141  |  sarcoma                   |  sarcoma         |
|  0.000055075  |  0.000057223  |  0.000063181  |  patrick_artner_with_sort  |  Patrick_Artner  |
|  0.000135660  |  0.000145028  |  0.000174046  |  deep_space_with_sort      |  DeepSpace       |

N = 1000
------
|  min          |  avg          |  max          |  func                      |  name            |
|---------------|---------------|---------------|----------------------------|------------------|
|  0.000294447  |  0.000559096  |  0.000992775  |  road_runner_with_sort     |  RoadRunner      |
|  0.000327826  |  0.000374844  |  0.000650883  |  patrick_artner_with_sort  |  Patrick_Artner  |
|  0.000344276  |  0.000605364  |  0.002207994  |  sarcoma                   |  sarcoma         |
|  0.000758171  |  0.001031160  |  0.002290487  |  deep_space_with_sort      |  DeepSpace       |

N = 10000
------
|  min          |  avg          |  max          |  func                      |  name            |
|---------------|---------------|---------------|----------------------------|------------------|
|  0.003607988  |  0.003875387  |  0.005285978  |  road_runner_with_sort     |  RoadRunner      |
|  0.003780127  |  0.004181504  |  0.005370378  |  sarcoma                   |  sarcoma         |
|  0.003986597  |  0.004258037  |  0.006756544  |  patrick_artner_with_sort  |  Patrick_Artner  |
|  0.007097244  |  0.007444410  |  0.009983778  |  deep_space_with_sort      |  DeepSpace       |

N = 25000
------
|  min          |  avg          |  max          |  func                      |  name            |
|---------------|---------------|---------------|----------------------------|------------------|
|  0.009672165  |  0.010055504  |  0.011536598  |  sarcoma                   |  sarcoma         |
|  0.019844294  |  0.022260010  |  0.027792931  |  road_runner_with_sort     |  RoadRunner      |
|  0.020462751  |  0.022415347  |  0.029330730  |  patrick_artner_with_sort  |  Patrick_Artner  |
|  0.024955750  |  0.027981100  |  0.031506777  |  deep_space_with_sort      |  DeepSpace  

排序數據基准

Sorted Benchmarks
N = 10
------
|  min          |  avg          |  max          |  func            |  name            |
|---------------|---------------|---------------|------------------|------------------|
|  0.000002861  |  0.000003138  |  0.000005960  |  road_runner     |  RoadRunner      |
|  0.000002861  |  0.000003231  |  0.000012398  |  patrick_artner  |  Patrick_Artner  |
|  0.000004292  |  0.000004461  |  0.000007629  |  sarcoma         |  sarcoma         |
|  0.000008821  |  0.000009136  |  0.000011921  |  deep_space      |  DeepSpace       |

N = 100
------
|  min          |  avg          |  max          |  func            |  name            |
|---------------|---------------|---------------|------------------|------------------|
|  0.000020027  |  0.000020833  |  0.000037909  |  road_runner     |  RoadRunner      |
|  0.000021458  |  0.000024126  |  0.000087738  |  patrick_artner  |  Patrick_Artner  |
|  0.000033140  |  0.000034373  |  0.000049591  |  sarcoma         |  sarcoma         |
|  0.000072241  |  0.000073054  |  0.000085592  |  deep_space      |  DeepSpace       |

N = 1000
------
|  min          |  avg          |  max          |  func            |  name            |
|---------------|---------------|---------------|------------------|------------------|
|  0.000200748  |  0.000207791  |  0.000290394  |  patrick_artner  |  Patrick_Artner  |
|  0.000207186  |  0.000219207  |  0.000277519  |  road_runner     |  RoadRunner      |
|  0.000333071  |  0.000369296  |  0.000570774  |  sarcoma         |  sarcoma         |
|  0.000635624  |  0.000721800  |  0.001362801  |  deep_space      |  DeepSpace       |

N = 10000
------
|  min          |  avg          |  max          |  func            |  name            |
|---------------|---------------|---------------|------------------|------------------|
|  0.002717972  |  0.002925014  |  0.003932238  |  patrick_artner  |  Patrick_Artner  |
|  0.002796888  |  0.003489044  |  0.004799843  |  road_runner     |  RoadRunner      |
|  0.004704714  |  0.005460148  |  0.008680582  |  sarcoma         |  sarcoma         |
|  0.005549192  |  0.006385834  |  0.009561062  |  deep_space      |  DeepSpace       |

N = 25000
------
|  min          |  avg          |  max          |  func            |  name            |
|---------------|---------------|---------------|------------------|------------------|
|  0.010142803  |  0.011239243  |  0.015279770  |  patrick_artner  |  Patrick_Artner  |
|  0.011211157  |  0.012368391  |  0.014696836  |  road_runner     |  RoadRunner      |
|  0.014389753  |  0.015374193  |  0.022623777  |  sarcoma         |  sarcoma         |
|  0.016021967  |  0.016560717  |  0.019297361  |  deep_space      |  DeepSpace       |

     |

可以在以下位置找到基准腳本: https : //github.com/sarcoma/python-script-benchmark-tools/blob/master/examples/filter_out_lowest_duplicates.py

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM