![](/img/trans.png)
[英]How to rearrange an Ordered Dictionary with a based on part of the key from a list
[英]Trim ordered dictionary based on key?
什么是基於它們的鍵“修剪”字典的最快方法? 我的理解是,自Python 3.7起,詞典現在保留了順序
我有一本包含鍵(日期時間類型):val(浮點型)的字典。 字典是按時間順序排序的。
time_series_dict =
{"2019-02-27 14:00:00": 95,
"2019-02-27 15:00:00": 98,
"2019-02-27 16:25:00: 80,
.............
"2019-03-01 12:15:00": 85
}
我想整理字典,刪除start_date和end_date之外的所有內容。 字典可以有1000個值。 有沒有比以下方法更快的方法:
for k in list(time_series_dict.keys()):
if not start_date <= k <= end_date:
del time_series_dict[k]
如果詞典中有1000個鍵,並且您要從有序的時間戳序列的開頭和結尾刪除鍵,請考慮使用二進制搜索在鍵的列表副本中查找截止點。 Python為此包括了bisect
模塊 :
from bisect import bisect_left, bisect_right
def trim_time_series_dict(tsd, start_date, end_date):
ts = list(tsd)
before = bisect_right(ts, start_date) # insertion point at > start_date
after = bisect_left(ts, end_date) # insertion point is < end_date
for i in range(before): # up to == start_date
del tsd[ts[i]]
for i in range(after + 1, len(ts)): # from >= end_date onwards
del tsd[ts[i]]
我已經進行了一些時間試驗,以了解這是否會與您的典型數據集有所不同。 如預期的那樣,當刪除的鍵的數量顯着低於輸入字典的長度時,它會得到回報。
定時試用設置(導入,構建測試數據字典以及開始和結束日期,定義測試功能)
>>> import random
>>> from bisect import bisect_left, bisect_right
>>> from datetime import datetime, timedelta
>>> from itertools import islice
>>> from timeit import Timer
>>> def randomised_ordered_timestamps():
... date = datetime.now().replace(second=0, microsecond=0)
... while True:
... date += timedelta(minutes=random.randint(15, 360))
... yield date.strftime('%Y-%m-%d %H:%M:%S')
...
>>> test_data = {ts: random.randint(50, 500) for ts in islice(randomised_ordered_timestamps(), 10000)}
>>> start_date = next(islice(test_data, 25, None)) # trim 25 from the start
>>> end_date = next(islice(test_data, len(test_data) - 25, None)) # trim 25 from the end
>>> def iteration(t, start_date, end_date):
... time_series_dict = t.copy() # avoid mutating test data
... for k in list(time_series_dict.keys()):
... if not start_date <= k <= end_date:
... del time_series_dict[k]
...
>>> def bisection(t, start_date, end_date):
... tsd = t.copy() # avoid mutating test data
... ts = list(tsd)
... before = bisect_right(ts, start_date) # insertion point at > start_date
... after = bisect_left(ts, end_date) # insertion point is < end_date
... for i in range(before): # up to == start_date
... del tsd[ts[i]]
... for i in range(after + 1, len(ts)): # from >= end_date onwards
... del tsd[ts[i]]
...
試驗結果:
>>> count, total = Timer("t.copy()", "from __main__ import test_data as t").autorange()
>>> baseline = total / count
>>> for test in (iteration, bisection):
... timer = Timer("test(t, s, e)", "from __main__ import test, test_data as t, start_date as s, end_date as e")
... count, total = timer.autorange()
... print(f"{test.__name__:>10}: {((total / count) - baseline) * 1000000:6.2f} microseconds")
...
iteration: 671.33 microseconds
bisection: 80.92 microseconds
(測試先減去制作dict副本的基准成本)。
但是,對於此類操作,可能會有更有效的數據結構。 我簽出了sortedcontainers
項目,因為它包括一個SortedDict()
類型 ,該類型直接支持鍵的二等分。 不幸的是,盡管它的性能比您的迭代方法要好,但在這里我不能比對鍵列表的副本進行平分更好:
>>> from sortedcontainers import SortedDict
>>> test_data_sorteddict = SortedDict(test_data)
>>> def sorteddict(t, start_date, end_date):
... tsd = t.copy()
... # SortedDict supports slicing on the key view
... keys = tsd.keys()
... del keys[:tsd.bisect_right(start_date)]
... del keys[tsd.bisect_left(end_date) + 1:]
...
>>> count, total = Timer("t.copy()", "from __main__ import test_data_sorteddict as t").autorange()
>>> baseline = total / count
>>> timer = Timer("test(t, s, e)", "from __main__ import sorteddict as test, test_data_sorteddict as t, start_date as s, end_date as e")
>>> count, total = timer.autorange()
>>> print(f"sorteddict: {((total / count) - baseline) * 1000000:6.2f} microseconds")
sorteddict: 249.46 microseconds
我可能在使用該項目時出錯。 從SortedDict
對象刪除鍵是O(NlogN),所以我懷疑這就是問題所在。 從其他9950鍵/值對創建新的SortedDict()
對象的速度仍然較慢(超過2毫秒,這不是您要與其他方法進行比較的時間)。
但是,如果要使用SortedDict.irange()
方法 ,則可以簡單地忽略值,而不是刪除它們,並遍歷字典鍵的子集:
for ts in timeseries(start_date, end_date, inclusive=(False, False)):
# iterates over all start_date > timestamp > end_date keys, in order.
無需刪除任何內容。 irange()
實現在irange()
使用平分。
import time
import timeit
print(timeit.timeit(setup="""import datetime
time_series_dict = {}
for i in range(10000):
t =datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S:%f')
time_series_dict[t] = i
if i ==100:
start_time = t
if i == 900:
end_time = t
""",
stmt="""
tmp = time_series_dict.copy()
for k in list(tmp.keys()):
if not start_time <= k <= end_time:
del tmp[k]
""",
number=10000
))
print(timeit.timeit(setup="""import datetime
time_series_dict = {}
for i in range(10000):
t =datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S:%f')
time_series_dict[t] = i
if i ==100:
start_time = t
if i == 900:
end_time = t
""",
stmt="""
tmp = time_series_dict.copy()
result = {}
for k in list(tmp.keys()):
if start_time <= k <= end_time:
result[k] = tmp[k]
""",
number=10000
))
print(timeit.timeit(setup="""
import datetime
from bisect import bisect_left, bisect_right
time_series_dict = {}
for i in range(10000):
t =datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S:%f')
time_series_dict[t] = i
if i ==100:
start_time = t
if i == 900:
end_time = t
""",
stmt="""
tmp = time_series_dict.copy()
def trim_time_series_dict(tsd, start_date, end_date):
ts = list(tsd)
before = bisect_right(ts, start_date) # insertion point at > start_date
after = bisect_left(ts, end_date) # insertion point is < end_date
for i in range(before): # up to == start_date
del tsd[ts[i]]
for i in range(after + 1, len(ts)): # from >= end_date onwards
del tsd[ts[i]]
trim_time_series_dict(tmp, start_time, end_time)
""",
number=10000
))
測試結果
12.558672609
9.662761111
7.990544049
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.