简体   繁体   English

使用Python中的列表和字典进行数值比较,优化循环效率

[英]Optimizing efficiency of loops with numeric comparison using list and dictionary in Python

I have a list with numbers that are an integer: candidates = [1, 2 ,3, 4 , 5, 16, 20] . 我有一个数字为整数的列表: candidates = [1, 2 ,3, 4 , 5, 16, 20] This list can contain > 1 million items. 此列表可包含> 100万个项目。

I have a dictionary number_ranges that has as key an integer, with a list as value that contains object with a minimum and maximum range. 我有一个字典number_ranges ,其键为整数,列表为包含最小和最大范围的对象的值。 This dictionary consists now of about 500k keys. 这本字典现在包含大约500k键。

{
    {5: [{"start": 0, "end": 9}]},
    {16: [{"start": 15, "end": 20}, {"start": 16, "end": 18}]}
}

I am now looping through the list: 我现在循环遍历列表:

for candidate in candidates:
    number = search_in_range(candidate, number_ranges)

where I check if there is a match of a number of candidates in the ranges of number_ranges , and if so, I return the key which will be used further on. 在那里我检查number_ranges范围内是否有多个candidatesnumber_ranges ,如果是,我将返回将进一步使用的密钥。

def search_in_range(candidate, number_ranges):
    for number_range_key in number_ranges:
        for number in number_ranges[number_range_key]:
            if int(number['start']) <= candidate <= int(number['end']):
                return {"key": number_range_key, "candidate": candidate}

When I run this, I see that it takes about 40 seconds to process 1000 numbers from the list. 当我运行它时,我发现从列表中处理1000个数字大约需要40秒。 This means that if I have 1 million numbers, I need more than 11 hours to process. 这意味着如果我有100万个数字,我需要超过11个小时来处理。

('2018-12-19 16:22:47', 'Read', 1000)
('2018-12-19 16:23:30', 'Read', 2000)
('2018-12-19 16:24:10', 'Read', 3000)
('2018-12-19 16:24:46', 'Read', 4000)
('2018-12-19 16:25:26', 'Read', 5000)
('2018-12-19 16:25:59', 'Read', 6000)
('2018-12-19 16:26:39', 'Read', 7000)
('2018-12-19 16:27:28', 'Read', 8000)
('2018-12-19 16:28:15', 'Read', 9000)
('2018-12-19 16:28:57', 'Read', 10000)

The expected output is returning the keys from number_ranges that are matching within the range and the candidate number used to find that key, ie return {"key": number_range_key, "candidate": candidate} in function search_in_range . 预期输出返回来自在该范围内匹配的number_ranges的密钥和用于找到该密钥的candidate编号,即在函数search_in_range return {"key": number_range_key, "candidate": candidate}

What are the recommended ways in Python to optimize this algorithm? Python优化此算法的推荐方法是什么?

Your list of candidates is sorted, so do the opposite: Loop the dictionaries in number_ranges and use bisect to binary-search the matching candidates. 您的candidates列表已经过排序,反之亦然:在number_ranges循环字典并使用bisect对匹配的候选项进行二进制搜索。 This will reduce the complexity from O(n*m) to O(n*logm*k) for n dictionaries, m candidates, and k matching candidates on average. 这将使n字典, m候选者和k匹配候选者的平均复杂度从O(n*m)O(n*logm*k)

(Note: I changed the format of your number_ranges from a set of dict with just a single element each to just a dict , which makes much more sense.) (注意:我将number_ranges的格式从一set dict改为每个只有一个元素,只是一个dict ,这更有意义。)

candidates = [1, 2, 3, 4, 5, 16, 20]
number_ranges = {
    5: [{"start": 0, "end": 9}],
    16: [{"start": 15, "end": 20}, {"start": 16, "end": 18}]
}

import bisect

for key, values in number_ranges.items():
    for value in values:
        start, end = value["start"], value["end"]
        lower = bisect.bisect_left(candidates, start)
        upper = bisect.bisect_right(candidates, end)
        for cand in range(lower, upper):
            res = {"key": key, "candidate": candidates[cand]}
            print(value, res)

Output: 输出:

{'start': 0, 'end': 9} {'key': 5, 'candidate': 1}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 2}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 3}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 4}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 5}
{'start': 15, 'end': 20} {'key': 16, 'candidate': 16}
{'start': 15, 'end': 20} {'key': 16, 'candidate': 20}
{'start': 16, 'end': 18} {'key': 16, 'candidate': 16}

If the candidates are not sorted in reality, or if you want the results to be sorted by candidate instead of by dictionary, you can just sort either as a pre- or post-processing step. 如果candidates没有按实际排序,或者您希望结果按候选人而不是字典排序,则可以将其排序为预处理或后处理步骤。

With a little bit of reorganisation, your code becomes a classic interval tree problem. 通过一些重组,您的代码成为典型的间隔树问题。

Have a look at this package https://pypi.org/project/intervaltree/ 看看这个包https://pypi.org/project/intervaltree/

The only divergence from a normal interval tree is that you have some items that cover multiple intervals, however it would be easy enough to break them into individual intervals, eg {16.1: {"start": 15, "end": 20}, 16.2: {"start": 16, "end": 18}} 与普通间隔树的唯一区别在于,您有一些项目覆盖多个区间,但是将它们分成单独的区间很容易,例如{16.1:{“start”:15,“end”:20}, 16.2:{“start”:16,“end”:18}}

By using the intervaltree package, a balanced binary search tree is created which is much more efficient than using nested for loops. 通过使用intervaltree包,创建了一个平衡的二叉搜索树,它比使用嵌套的for循环更有效。 This solution is O(logn) for searching each candidate, whereas a for loop is O(n). 该解决方案是用于搜索每个候选者的O(logn),而for循环是O(n)。 If there are 1MM+ candidates, the intervaltree package is going to be considerably faster than the accepted nested for loop answer. 如果有1MM +候选者,则intervaltree包将比接受的嵌套for循环回答快得多。

Even though this question has an accepted answer, i would add for the sake of others that this type of scenario really justifies creating a reverse lookup. 即使这个问题有一个公认的答案,我也会为了其他人的缘故添加这种情况,这种方法确实有助于创建反向查找。 It is a 1 time headache that will save a lot of practical time as candidate list grows longer. 这是一次头痛,随着候选人名单的增长,这将节省大量的实际时间。 Dictionary lookups are O(1) and if you need to perform multiple lookups, you should consider creating a reverse mapping as well. 字典查找是O(1),如果需要执行多次查找,还应考虑创建反向映射。

number_ranges = [
    {5: [{"start": 0, "end": 9}]},
    {16: [{"start": 15, "end": 20}, {"start": 16, "end": 18}]}
]

from collections import defaultdict

reversed_number_ranges = defaultdict(set) #returns an empty set, avoiding key errors.


for number in number_ranges:
    for k,v in number.items(): 
        ranges = set() #create a set of values which fall within range
        for range_dict in v:
            ranges.update(range(range_dict["start"], range_dict["end"] + 1)) #assuming "end" is included. remove the +1 for right exclusive.
        for i in ranges:
            reversed_number_ranges[i].add(k) #add the key for each location in a range.


candidates = [1, 2 ,3, 4 , 5, 16, 20]

for candidate in candidates:
    print(candidate, reversed_number_ranges[candidate])

Output: 输出:

1 {5}
2 {5}
3 {5}
4 {5}
5 {5}
16 {16}
20 {16}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM