简体   繁体   中英

Trim ordered dictionary based on key?

What is the fastest way to "trim" a dictionary based on they key? My understanding is that dictionaries now preserve order since Python 3.7

I have a dictionary that contains key (type datetime): val (type float). The dictionary is in a sorted (chronological) order.

time_series_dict = 
{"2019-02-27 14:00:00": 95,
"2019-02-27 15:00:00": 98,
"2019-02-27 16:25:00: 80,
.............
"2019-03-01 12:15:00": 85
}

I would like to trim the dictionary, removing everything outside of start_date and end_date . Dictionary can have 1000s of values. Is there a faster method than:

for k in list(time_series_dict.keys()):
    if not start_date <= k <= end_date:
        del time_series_dict[k]

If your dictionaries have 1000s of keys, and you are removing keys from the start and end of the ordered sequence of timestamps, consider using binary search to find the cut-off points in a list copy of the keys. Python includes the bisect module for this:

from bisect import bisect_left, bisect_right

def trim_time_series_dict(tsd, start_date, end_date):
    ts = list(tsd)
    before = bisect_right(ts, start_date)  # insertion point at > start_date
    after = bisect_left(ts, end_date)      # insertion point is < end_date
    for i in range(before):                # up to == start_date
        del tsd[ts[i]]
    for i in range(after + 1, len(ts)):    # from >= end_date onwards
        del tsd[ts[i]]

I've run some time trials to see if this is going to make a difference against your typical datasets; as expected, it pays off when the number of keys removed is significantly lower than the length of the input dictionary.

Time trial setup (imports, building the test data dictionary and start and end dates, defining the test functions)

>>> import random
>>> from bisect import bisect_left, bisect_right
>>> from datetime import datetime, timedelta
>>> from itertools import islice
>>> from timeit import Timer
>>> def randomised_ordered_timestamps():
...     date = datetime.now().replace(second=0, microsecond=0)
...     while True:
...         date += timedelta(minutes=random.randint(15, 360))
...         yield date.strftime('%Y-%m-%d %H:%M:%S')
...
>>> test_data = {ts: random.randint(50, 500) for ts in islice(randomised_ordered_timestamps(), 10000)}
>>> start_date = next(islice(test_data, 25, None))                 # trim 25 from the start
>>> end_date = next(islice(test_data, len(test_data) - 25, None))  # trim 25 from the end
>>> def iteration(t, start_date, end_date):
...     time_series_dict = t.copy()  # avoid mutating test data
...     for k in list(time_series_dict.keys()):
...         if not start_date <= k <= end_date:
...             del time_series_dict[k]
...
>>> def bisection(t, start_date, end_date):
...     tsd = t.copy()  # avoid mutating test data
...     ts = list(tsd)
...     before = bisect_right(ts, start_date)  # insertion point at > start_date
...     after = bisect_left(ts, end_date)      # insertion point is < end_date
...     for i in range(before):                # up to == start_date
...         del tsd[ts[i]]
...     for i in range(after + 1, len(ts)):    # from >= end_date onwards
...         del tsd[ts[i]]
...

Trial outcome:

>>> count, total = Timer("t.copy()", "from __main__ import test_data as t").autorange()
>>> baseline = total / count
>>> for test in (iteration, bisection):
...     timer = Timer("test(t, s, e)", "from __main__ import test, test_data as t, start_date as s, end_date as e")
...     count, total = timer.autorange()
...     print(f"{test.__name__:>10}: {((total / count) - baseline) * 1000000:6.2f} microseconds")
...
 iteration: 671.33 microseconds
 bisection:  80.92 microseconds

(The test subtracts the base-line cost of making a dict copy first).

However, there may well be more efficient data structures for these kind of operations. I checked out the sortedcontainers project as it includes a SortedDict() type that supports bisection on the keys directly. Unfortunately, while it performs better than your iteration approach, I can't make it perform better here than bisecting on a copy of the keys list:

>>> from sortedcontainers import SortedDict
>>> test_data_sorteddict = SortedDict(test_data)
>>> def sorteddict(t, start_date, end_date):
...     tsd = t.copy()
...     # SortedDict supports slicing on the key view
...     keys = tsd.keys()
...     del keys[:tsd.bisect_right(start_date)]
...     del keys[tsd.bisect_left(end_date) + 1:]
...
>>> count, total = Timer("t.copy()", "from __main__ import test_data_sorteddict as t").autorange()
>>> baseline = total / count
>>> timer = Timer("test(t, s, e)", "from __main__ import sorteddict as test, test_data_sorteddict as t, start_date as s, end_date as e")
>>> count, total = timer.autorange()
>>> print(f"sorteddict: {((total / count) - baseline) * 1000000:6.2f} microseconds")
sorteddict: 249.46 microseconds

I may be using the project wrong, however. Deleting keys from SortedDict objects is O(NlogN) so I suspect that that's where this falls down. Creating a new SortedDict() object from the other 9950 key-value pairs is slower still (over 2 milliseconds, not something you want to compare against the other approaches).

However, if you were to use the SortedDict.irange() method you can simply ignore values, not delete them, and iterate over a sub-set of dictionary keys:

for ts in timeseries(start_date, end_date, inclusive=(False, False)):
    # iterates over all start_date > timestamp > end_date keys, in order.

eliminating the need to delete anything. The irange() implementation uses bisection under the hood.

import time

import timeit

print(timeit.timeit(setup="""import datetime
time_series_dict = {}
for i in range(10000):
    t =datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S:%f')
    time_series_dict[t] = i
    if i ==100:
        start_time = t
    if i == 900:
        end_time = t
        """,
stmt="""
tmp = time_series_dict.copy()
for k in list(tmp.keys()):
    if not start_time <= k <= end_time:
        del tmp[k]

""",
number=10000
))
print(timeit.timeit(setup="""import datetime
time_series_dict = {}
for i in range(10000):
    t =datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S:%f')
    time_series_dict[t] = i
    if i ==100:
        start_time = t
    if i == 900:
        end_time = t
""",
stmt="""
tmp = time_series_dict.copy()
result = {}
for k in list(tmp.keys()):
    if start_time <= k <= end_time:
        result[k] = tmp[k]
""",
number=10000
))
print(timeit.timeit(setup="""
import datetime
from bisect import bisect_left, bisect_right

time_series_dict = {}
for i in range(10000):
    t =datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S:%f')
    time_series_dict[t] = i
    if i ==100:
        start_time = t
    if i == 900:
        end_time = t

""",
stmt="""
tmp = time_series_dict.copy()
def trim_time_series_dict(tsd, start_date, end_date):
    ts = list(tsd)
    before = bisect_right(ts, start_date)  # insertion point at > start_date
    after = bisect_left(ts, end_date)      # insertion point is < end_date
    for i in range(before):                # up to == start_date
        del tsd[ts[i]]
    for i in range(after + 1, len(ts)):    # from >= end_date onwards
        del tsd[ts[i]]

trim_time_series_dict(tmp, start_time, end_time)
""",
number=10000
))

test result

12.558672609
9.662761111
7.990544049

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM