简体   繁体   English

是否根据键修剪有序字典?

[英]Trim ordered dictionary based on key?

What is the fastest way to "trim" a dictionary based on they key? 什么是基于它们的键“修剪”字典的最快方法? My understanding is that dictionaries now preserve order since Python 3.7 我的理解是,自Python 3.7起,词典现在保留了顺序

I have a dictionary that contains key (type datetime): val (type float). 我有一本包含键(日期时间类型):val(浮点型)的字典。 The dictionary is in a sorted (chronological) order. 字典是按时间顺序排序的。

time_series_dict = 
{"2019-02-27 14:00:00": 95,
"2019-02-27 15:00:00": 98,
"2019-02-27 16:25:00: 80,
.............
"2019-03-01 12:15:00": 85
}

I would like to trim the dictionary, removing everything outside of start_date and end_date . 我想整理字典,删除start_dateend_date之外的所有内容。 Dictionary can have 1000s of values. 字典可以有1000个值。 Is there a faster method than: 有没有比以下方法更快的方法:

for k in list(time_series_dict.keys()):
    if not start_date <= k <= end_date:
        del time_series_dict[k]

If your dictionaries have 1000s of keys, and you are removing keys from the start and end of the ordered sequence of timestamps, consider using binary search to find the cut-off points in a list copy of the keys. 如果词典中有1000个键,并且您要从有序的时间戳序列的开头和结尾删除键,请考虑使用二进制搜索在键的列表副本中查找截止点。 Python includes the bisect module for this: Python为此包括了bisect模块

from bisect import bisect_left, bisect_right

def trim_time_series_dict(tsd, start_date, end_date):
    ts = list(tsd)
    before = bisect_right(ts, start_date)  # insertion point at > start_date
    after = bisect_left(ts, end_date)      # insertion point is < end_date
    for i in range(before):                # up to == start_date
        del tsd[ts[i]]
    for i in range(after + 1, len(ts)):    # from >= end_date onwards
        del tsd[ts[i]]

I've run some time trials to see if this is going to make a difference against your typical datasets; 我已经进行了一些时间试验,以了解这是否会与您的典型数据集有所不同。 as expected, it pays off when the number of keys removed is significantly lower than the length of the input dictionary. 如预期的那样,当删除的键的数量显着低于输入字典的长度时,它会得到回报。

Time trial setup (imports, building the test data dictionary and start and end dates, defining the test functions) 定时试用设置(导入,构建测试数据字典以及开始和结束日期,定义测试功能)

>>> import random
>>> from bisect import bisect_left, bisect_right
>>> from datetime import datetime, timedelta
>>> from itertools import islice
>>> from timeit import Timer
>>> def randomised_ordered_timestamps():
...     date = datetime.now().replace(second=0, microsecond=0)
...     while True:
...         date += timedelta(minutes=random.randint(15, 360))
...         yield date.strftime('%Y-%m-%d %H:%M:%S')
...
>>> test_data = {ts: random.randint(50, 500) for ts in islice(randomised_ordered_timestamps(), 10000)}
>>> start_date = next(islice(test_data, 25, None))                 # trim 25 from the start
>>> end_date = next(islice(test_data, len(test_data) - 25, None))  # trim 25 from the end
>>> def iteration(t, start_date, end_date):
...     time_series_dict = t.copy()  # avoid mutating test data
...     for k in list(time_series_dict.keys()):
...         if not start_date <= k <= end_date:
...             del time_series_dict[k]
...
>>> def bisection(t, start_date, end_date):
...     tsd = t.copy()  # avoid mutating test data
...     ts = list(tsd)
...     before = bisect_right(ts, start_date)  # insertion point at > start_date
...     after = bisect_left(ts, end_date)      # insertion point is < end_date
...     for i in range(before):                # up to == start_date
...         del tsd[ts[i]]
...     for i in range(after + 1, len(ts)):    # from >= end_date onwards
...         del tsd[ts[i]]
...

Trial outcome: 试验结果:

>>> count, total = Timer("t.copy()", "from __main__ import test_data as t").autorange()
>>> baseline = total / count
>>> for test in (iteration, bisection):
...     timer = Timer("test(t, s, e)", "from __main__ import test, test_data as t, start_date as s, end_date as e")
...     count, total = timer.autorange()
...     print(f"{test.__name__:>10}: {((total / count) - baseline) * 1000000:6.2f} microseconds")
...
 iteration: 671.33 microseconds
 bisection:  80.92 microseconds

(The test subtracts the base-line cost of making a dict copy first). (测试先减去制作dict副本的基准成本)。

However, there may well be more efficient data structures for these kind of operations. 但是,对于此类操作,可能会有更有效的数据结构。 I checked out the sortedcontainers project as it includes a SortedDict() type that supports bisection on the keys directly. 我签出了sortedcontainers项目,因为它包括一个SortedDict()类型 ,该类型直接支持键的二等分。 Unfortunately, while it performs better than your iteration approach, I can't make it perform better here than bisecting on a copy of the keys list: 不幸的是,尽管它的性能比您的迭代方法要好,但在这里我不能比对键列表的副本进行平分更好:

>>> from sortedcontainers import SortedDict
>>> test_data_sorteddict = SortedDict(test_data)
>>> def sorteddict(t, start_date, end_date):
...     tsd = t.copy()
...     # SortedDict supports slicing on the key view
...     keys = tsd.keys()
...     del keys[:tsd.bisect_right(start_date)]
...     del keys[tsd.bisect_left(end_date) + 1:]
...
>>> count, total = Timer("t.copy()", "from __main__ import test_data_sorteddict as t").autorange()
>>> baseline = total / count
>>> timer = Timer("test(t, s, e)", "from __main__ import sorteddict as test, test_data_sorteddict as t, start_date as s, end_date as e")
>>> count, total = timer.autorange()
>>> print(f"sorteddict: {((total / count) - baseline) * 1000000:6.2f} microseconds")
sorteddict: 249.46 microseconds

I may be using the project wrong, however. 我可能在使用该项目时出错。 Deleting keys from SortedDict objects is O(NlogN) so I suspect that that's where this falls down. SortedDict对象删除键是O(NlogN),所以我怀疑这就是问题所在。 Creating a new SortedDict() object from the other 9950 key-value pairs is slower still (over 2 milliseconds, not something you want to compare against the other approaches). 从其他9950键/值对创建新的SortedDict()对象的速度仍然较慢(超过2毫秒,这不是您要与其他方法进行比较的时间)。

However, if you were to use the SortedDict.irange() method you can simply ignore values, not delete them, and iterate over a sub-set of dictionary keys: 但是,如果要使用SortedDict.irange()方法 ,则可以简单地忽略值,而不是删除它们,并遍历字典键的子集:

for ts in timeseries(start_date, end_date, inclusive=(False, False)):
    # iterates over all start_date > timestamp > end_date keys, in order.

eliminating the need to delete anything. 无需删除任何内容。 The irange() implementation uses bisection under the hood. irange()实现在irange()使用平分。

import time

import timeit

print(timeit.timeit(setup="""import datetime
time_series_dict = {}
for i in range(10000):
    t =datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S:%f')
    time_series_dict[t] = i
    if i ==100:
        start_time = t
    if i == 900:
        end_time = t
        """,
stmt="""
tmp = time_series_dict.copy()
for k in list(tmp.keys()):
    if not start_time <= k <= end_time:
        del tmp[k]

""",
number=10000
))
print(timeit.timeit(setup="""import datetime
time_series_dict = {}
for i in range(10000):
    t =datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S:%f')
    time_series_dict[t] = i
    if i ==100:
        start_time = t
    if i == 900:
        end_time = t
""",
stmt="""
tmp = time_series_dict.copy()
result = {}
for k in list(tmp.keys()):
    if start_time <= k <= end_time:
        result[k] = tmp[k]
""",
number=10000
))
print(timeit.timeit(setup="""
import datetime
from bisect import bisect_left, bisect_right

time_series_dict = {}
for i in range(10000):
    t =datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S:%f')
    time_series_dict[t] = i
    if i ==100:
        start_time = t
    if i == 900:
        end_time = t

""",
stmt="""
tmp = time_series_dict.copy()
def trim_time_series_dict(tsd, start_date, end_date):
    ts = list(tsd)
    before = bisect_right(ts, start_date)  # insertion point at > start_date
    after = bisect_left(ts, end_date)      # insertion point is < end_date
    for i in range(before):                # up to == start_date
        del tsd[ts[i]]
    for i in range(after + 1, len(ts)):    # from >= end_date onwards
        del tsd[ts[i]]

trim_time_series_dict(tmp, start_time, end_time)
""",
number=10000
))

test result 测试结果

12.558672609
9.662761111
7.990544049

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM