简体   繁体   中英

Look up python dict value by expression

I have a dict that has unix epoch timestamps for keys, like so:

lookup_dict = {
    1357899: {} #some dict of data
    1357910: {} #some other dict of data
}

Except, you know, millions and millions and millions of entries. I'd like to subset this dict, over and over again. Ideally, I'd love to be able to write something like I can in R, like:

lookup_value = 1357900
dict_subset = lookup_dict[key >= lookup_value]
# dict_subset now contains {1357910: {}}

But I confess, I can't find any actual proof that this is something Python can do without having, one way or the other, to iterate over every row. If I understand Python correctly (and I might not), key lookup of the form key in dict uses binary search, and is thus very fast; any way to do a binary search, on dict keys?

To do this without iterating, you're going to need the keys in sorted order. Then you just need to do a binary search for the first one >= lookup_value , instead of checking each one for >= lookup_value .

If you're willing to use a third-party library, there are plenty out there. The first two that spring to mind are bintrees (which uses a red-black tree, like C++, Java, etc.) and blist (which uses a B+Tree). For example, with bintrees , it's as simple as this:

dict_subset = lookup_dict[lookup_value:]

And this will be as efficient as you'd hope—basically, it adds a single O(log N) search on top of whatever the cost of using that subset. (Of course usually what you want to do with that subset is iterate the whole thing, which ends up being O(N) anyway… but maybe you're doing something different, or maybe the subset is only 10 keys out of 1000000.)

Of course there is a tradeoff. Random access to a tree-based mapping is O(log N) instead of "usually O(1)". Also, your keys obviously need to be fully ordered, instead of hashable (and that's a lot harder to detect automatically and raise nice error messages on).

If you want to build this yourself, you can. You don't even necessarily need a tree; just a sorted list of keys alongside a dict . You can maintain the list with the bisect module in the stdlib, as JonClements suggested. You may want to wrap up bisect to make a sorted list object—or, better, get one of the recipes on ActiveState or PyPI to do it for you. You can then wrap the sorted list and the dict together into a single object, so you don't accidentally update one without updating the other. And then you can extend the interface to be as nice as bintrees , if you want.

Using the following code will work out

some_time_to_filter_for = # blah unix time
# Create a new sub-dictionary
sub_dict = {key: val for key, val in lookup_dict.items() 
            if key >= some_time_to_filter_for}

Basically we just iterate through all the keys in your dictionary and given a time to filter out for we take all the keys that are greater than or equal to that value and place them into our new dictionary

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM