简体   繁体   English

Python - 找到最接近的时间戳

[英]Python - Locating the closest timestamp

I have a Python datetime timestamp and a large dict (index) where keys are timestamps and the values are some other information I'm interested in. 我有一个Python日期时间戳和一个大的dict(索引),其中键是时间戳,值是我感兴趣的其他一些信息。

I need to find the datetime (the key) in index that is closest to timestamp, as efficiently as possible. 我需要尽可能有效地在索引中找到最接近时间戳的日期时间(键)。

At the moment I'm doing something like: 目前我做的事情如下:

for timestamp in timestamps:
    closestTimestamp = min(index,key=lambda datetime : abs(timestamp - datetime))

which works, but takes too long - my index dict has millions of values, and I'm doing the search thousands of times. 哪个有效,但需要太长时间 - 我的索引字典有数百万个值,我正在进行数千次搜索。 I'm flexible with data structures and so on - the timestamps are roughly sequential, so that I'm iterating from the first to the last timestamps. 我对数据结构很灵活等等 - 时间戳大致是顺序的,所以我从第一个时间戳到最后一个时间戳进行迭代。 Likewise the timestamps in the text file that I load into the dict are sequential. 同样,我加载到dict中的文本文件中的时间戳是顺序的。

Any ideas for optimisation would be greatly appreciated. 任何优化的想法将不胜感激。

Dictionaries aren't organized for efficient near miss searches. 字典不是为有效的近距离搜索而组织的。 They are designed for exact matches (using a hash table ). 它们专为完全匹配而设计(使用哈希表 )。

You may be better-off maintaining a separate, fast-searchable ordered structure. 您可能会更好地维护一个单独的,快速可搜索的有序结构。

A simple way to start off is to use the bisect module for fast O(log N) searches but slower O(n) insertions: 一个简单的方法是使用bisect模块进行快速O(log N)搜索,但使用较慢的O(n)插入:

def nearest(ts):
    # Given a presorted list of timestamps:  s = sorted(index)
    i = bisect_left(s, ts)
    return min(s[max(0, i-1): i+2], key=lambda t: abs(ts - t))

A more sophisticated approach suitable for non-static, dynamically updated dicts, would be to use blist which employs a tree structure for fast O(log N) insertions and lookups. 适用于非静态,动态更新的dicts的更复杂的方法是使用blist ,其使用树结构进行快速O(log N)插入和查找。 You only need this if the dict is going to change over time. 如果dict会随着时间的推移而改变,你只需要这个。

If you want to stay with a dictionary based approach, consider a dict-of-lists that clusters entries with nearby timestamps: 如果您希望继续使用基于字典的方法,请考虑使用附近时间戳聚集条目的列表词典:

 def get_closest_stamp(ts):
      'Speed-up timestamp search by looking only at entries in the same hour'
      hour = round_to_nearest_hour(ts)
      cluster = daydict[hour]         # return a list of entries
      return min(cluster, key=lambda t: abs(ts - t))

Note, for exact results near cluster boundaries, store close-to-the-boundary timestamps in both the primary cluster and the adjacent cluster. 请注意,对于群集边界附近的精确结果,请在主群集和相邻群集中存储接近边界的时间戳。

datetime objects are comparable to each other, so make a sorted list of your key/value pairs like this: datetime对象彼此相似,因此请按以下方式创建键/值对的排序列表:

myPairs = list(dict.iteritems())
myPairs.sort()

For each element myPairs[i] , myPairs[i][0] is the datetime key and myPairs[i][1] is the value. 对于每个元素myPairs[i]myPairs[i][0]datetime键, myPairs[i][1]是值。

You can search this list efficiently using bisect_left : 您可以使用bisect_left高效搜索此列表:

import bisect
i = bisect.bisect_left(myPairs, targetDatetime)

The element myPairs[i] is the element with the lowest datetime no earlier than targetDatetime . 元素myPairs[i]是日期时间最短且不早于targetDatetime But the prior element (if there is one) might be closer in time to targetDatetime . 但是先前的元素(如果有的话)可能会更接近targetDatetime Or targetDatetime might be later than any time in myPairs . 或者targetDatetime可能晚于myPairs任何时间。 So you need to check: 所以你需要检查:

if i > 0 and i == len(myPairs):
    i -= 1
elif i > 0 and targetDatetime - myPairs[i-1][0] < myPairs[i][0]- targetDatetime:
    i -= 1

If your list is truly sorted and not just "roughly sequential", you can use a binary search. 如果您的列表是真正排序的,而不仅仅是“大致顺序”,您可以使用二进制搜索。 Have a look at the bisect module documentation for more information. 有关更多信息,请查看bisect模块文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM