I have a list with numbers that are an integer: candidates = [1, 2 ,3, 4 , 5, 16, 20]
. This list can contain > 1 million items.
I have a dictionary number_ranges
that has as key an integer, with a list as value that contains object with a minimum and maximum range. This dictionary consists now of about 500k keys.
{
{5: [{"start": 0, "end": 9}]},
{16: [{"start": 15, "end": 20}, {"start": 16, "end": 18}]}
}
I am now looping through the list:
for candidate in candidates:
number = search_in_range(candidate, number_ranges)
where I check if there is a match of a number of candidates
in the ranges of number_ranges
, and if so, I return the key which will be used further on.
def search_in_range(candidate, number_ranges):
for number_range_key in number_ranges:
for number in number_ranges[number_range_key]:
if int(number['start']) <= candidate <= int(number['end']):
return {"key": number_range_key, "candidate": candidate}
When I run this, I see that it takes about 40 seconds to process 1000 numbers from the list. This means that if I have 1 million numbers, I need more than 11 hours to process.
('2018-12-19 16:22:47', 'Read', 1000)
('2018-12-19 16:23:30', 'Read', 2000)
('2018-12-19 16:24:10', 'Read', 3000)
('2018-12-19 16:24:46', 'Read', 4000)
('2018-12-19 16:25:26', 'Read', 5000)
('2018-12-19 16:25:59', 'Read', 6000)
('2018-12-19 16:26:39', 'Read', 7000)
('2018-12-19 16:27:28', 'Read', 8000)
('2018-12-19 16:28:15', 'Read', 9000)
('2018-12-19 16:28:57', 'Read', 10000)
The expected output is returning the keys from number_ranges
that are matching within the range and the candidate
number used to find that key, ie return {"key": number_range_key, "candidate": candidate}
in function search_in_range
.
What are the recommended ways in Python to optimize this algorithm?
Your list of candidates
is sorted, so do the opposite: Loop the dictionaries in number_ranges
and use bisect
to binary-search the matching candidates. This will reduce the complexity from O(n*m)
to O(n*logm*k)
for n
dictionaries, m
candidates, and k
matching candidates on average.
(Note: I changed the format of your number_ranges
from a set
of dict
with just a single element each to just a dict
, which makes much more sense.)
candidates = [1, 2, 3, 4, 5, 16, 20]
number_ranges = {
5: [{"start": 0, "end": 9}],
16: [{"start": 15, "end": 20}, {"start": 16, "end": 18}]
}
import bisect
for key, values in number_ranges.items():
for value in values:
start, end = value["start"], value["end"]
lower = bisect.bisect_left(candidates, start)
upper = bisect.bisect_right(candidates, end)
for cand in range(lower, upper):
res = {"key": key, "candidate": candidates[cand]}
print(value, res)
Output:
{'start': 0, 'end': 9} {'key': 5, 'candidate': 1}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 2}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 3}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 4}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 5}
{'start': 15, 'end': 20} {'key': 16, 'candidate': 16}
{'start': 15, 'end': 20} {'key': 16, 'candidate': 20}
{'start': 16, 'end': 18} {'key': 16, 'candidate': 16}
If the candidates
are not sorted in reality, or if you want the results to be sorted by candidate instead of by dictionary, you can just sort either as a pre- or post-processing step.
With a little bit of reorganisation, your code becomes a classic interval tree problem.
Have a look at this package https://pypi.org/project/intervaltree/
The only divergence from a normal interval tree is that you have some items that cover multiple intervals, however it would be easy enough to break them into individual intervals, eg {16.1: {"start": 15, "end": 20}, 16.2: {"start": 16, "end": 18}}
By using the intervaltree package, a balanced binary search tree is created which is much more efficient than using nested for loops. This solution is O(logn) for searching each candidate, whereas a for loop is O(n). If there are 1MM+ candidates, the intervaltree package is going to be considerably faster than the accepted nested for loop answer.
Even though this question has an accepted answer, i would add for the sake of others that this type of scenario really justifies creating a reverse lookup. It is a 1 time headache that will save a lot of practical time as candidate list grows longer. Dictionary lookups are O(1) and if you need to perform multiple lookups, you should consider creating a reverse mapping as well.
number_ranges = [
{5: [{"start": 0, "end": 9}]},
{16: [{"start": 15, "end": 20}, {"start": 16, "end": 18}]}
]
from collections import defaultdict
reversed_number_ranges = defaultdict(set) #returns an empty set, avoiding key errors.
for number in number_ranges:
for k,v in number.items():
ranges = set() #create a set of values which fall within range
for range_dict in v:
ranges.update(range(range_dict["start"], range_dict["end"] + 1)) #assuming "end" is included. remove the +1 for right exclusive.
for i in ranges:
reversed_number_ranges[i].add(k) #add the key for each location in a range.
candidates = [1, 2 ,3, 4 , 5, 16, 20]
for candidate in candidates:
print(candidate, reversed_number_ranges[candidate])
Output:
1 {5}
2 {5}
3 {5}
4 {5}
5 {5}
16 {16}
20 {16}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.