[英]How to optimize this loop in Python?
I have to compare two lists of dicts such as below:我必须比较两个字典列表,如下所示:
main = [{'id': 1,'rate': 13,'type'= 'C'}, {'id': 2,'rate': 39,'type': 'A'}, ...]
compare = [{'id': 119, 'rate': 33, 'type': 'D'}, {'id': 120, 'rate': 94, 'type': 'A'}, ...]
for m in main:
for c in compare:
if (m['rate'] > c['rate']) and (m['type'] == c['type']):
# ...
The lists have around 9,000 items.这些列表有大约 9,000 项。 The above code runs around 81,000,000 times (9,000 * 9,000).
上面的代码运行了大约 81,000,000 次 (9,000 * 9,000)。 How can I speed this up?
我怎样才能加快速度?
You could first sort or split the lists by type and perform the comparisons per type only.您可以首先按类型对列表进行排序或拆分,然后仅对每种类型执行比较。 The question then is: how many operations do you need for sorting or splitting and how many for comparison.
那么问题是:排序或拆分需要多少操作,比较需要多少操作。 Remember that there are quite efficient sort algorithms.
请记住,有非常有效的排序算法。
The next optimizazion could be sorting by rate.下一个优化可能是按速率排序。 That way you can break the loop when the condition
m['rate'] > c['rate']
is not satisfied any more.这样,当条件
m['rate'] > c['rate']
不再满足时,您可以中断循环。 In fact, you can even do a command and conquer algorithm.事实上,你甚至可以做一个命令和征服算法。
Last not least, you might benefit from Why is processing a sorted array faster than processing an unsorted array?最后同样重要的是,您可能会受益于为什么处理排序数组比处理未排序数组更快? , which is not an algorithmic improvement, but can still make a huge difference.
,这不是算法上的改进,但仍然可以产生巨大的影响。
Let me generate a dataset with 9000 items (in the future, you may want to include such a thing in your question, since it makes our life easier):让我生成一个包含 9000 个项目的数据集(将来,您可能希望在您的问题中包含这样的内容,因为它让我们的生活更轻松):
import random
types = ["A", "B", "C", "D", "E", "F"]
main=[]
compare = []
for i in range(9000):
main.append({'id':random.randint(0,20000), 'rate':random.random()*500, 'type':types[random.randint(0,5)]})
compare.append({'id': random.randint(0, 20000), 'rate': random.random() * 500, 'type': types[random.randint(0, 5)]})
Running this with a loop like用一个像这样的循环运行它
import time
start = time.time()
cycles = 0
for m in main:
for c in compare:
cycles += 1
if (m['rate'] > c['rate']) and (m['type'] == c['type']):
pass
end = time.time()
print("Total number of cycles "+str(cycles))
print("Seconds taken: " + str(end - start))
it results (on my machine) in 81M cycles and ~30 seconds.它(在我的机器上)导致 81M 周期和约 30 秒。
Splitting by type might look like this:按类型拆分可能如下所示:
# Split by types
mainsplit = {}
compsplit = {}
for t in types:
cycles += 1
mainsplit[t] = []
compsplit[t] = []
for m in main:
cycles += 1
mainsplit[m["type"]].append(m)
for c in compare:
cycles += 1
compsplit[c["type"]].append(c)
# Then go through it by type
for t in types:
for m in mainsplit[t]:
for c in compsplit[t]:
cycles += 1
if m['rate'] > c['rate']:
pass
This gives ~14M cycles and only ~4 s.这给出了~14M 个周期,并且只有~4 秒。
Sorting the partial results by "rate" and finding a lower limit for "rate":按 "rate" 对部分结果进行排序并找到 "rate" 的下限:
# Then go through it by type
for t in types:
mainsplit[t].sort(key=lambda i:i["rate"])
compsplit[t].sort(key=lambda i:i["rate"])
start_of_m_in_c = 0
for m in mainsplit[t]:
for nc in range(start_of_m_in_c, len(compsplit[t])):
cycles += 1
if m["rate"] > compsplit[t][nc]["rate"]:
pass
else:
start_of_m_in_c = nc
Cycles is now 36000 (not counting the cycles used by the sort algorithm) and the time to 30 ms. Cycles 现在是 36000(不计算排序算法使用的周期),时间为 30 ms。
All in all, that's a performance increase of a factor 1000.总而言之,这是 1000 倍的性能提升。
Given:鉴于:
main = [
{'id': 1, 'rate': 13, 'type': 'C'},
{'id': 2, 'rate': 39, 'type': 'A'},
{'id': 3, 'rate': 94, 'type': 'A'},
{'id': 4, 'rate': 95, 'type': 'A'},
{'id': 5, 'rate': 96, 'type': 'A'}
]
compare = [
{'id': 119, 'rate': 33, 'type': 'D'},
{'id': 120, 'rate': 94, 'type': 'A'}
]
You can first map the two lists of dicts into two dicts of lists of dicts indexed by type
, and sort sub-lists by rate
:您可以先 map 将两个字典列表转换为按
type
索引的字典列表的两个字典,并按rate
对子列表进行排序:
mappings = []
for lst in main, compare:
mappings.append({})
for entry in lst:
mappings[-1].setdefault(entry['type'], []).append(entry)
for entries in mappings[-1].values():
entries.sort(key=lambda entry: entry['rate'])
main, compare = mappings
so that main
becomes:所以
main
变成:
{'C': [{'id': 1, 'rate': 13, 'type': 'C'}],
'A': [{'id': 2, 'rate': 39, 'type': 'A'},
{'id': 3, 'rate': 94, 'type': 'A'},
{'id': 4, 'rate': 95, 'type': 'A'},
{'id': 5, 'rate': 96, 'type': 'A'}]}
while compare
becomes:而
compare
变成:
{'D': [{'id': 119, 'rate': 33, 'type': 'D'}],
'A': [{'id': 120, 'rate': 94, 'type': 'A'}]}
so that you iterate through the matching types of the two dicts in linear time, and use bisect
to find the index in each sub-list of main
where the rate
is greater than that of compare
, which takes a time complexity of O(log n) , and then iterate through the rest of the sub-list from that index for processing.这样你就可以在线性时间内遍历两个字典的匹配类型,并使用
bisect
在main
的每个子列表中查找rate
大于compare
的索引,这需要O(log n ) ,然后从该索引遍历子列表的 rest 进行处理。 Overall this algorithm is of O(n log n) in time complexity, an improvement over the O(n ^ 2) time complexity of your original code:总体而言,该算法的时间复杂度为O(n log n) ,比原始代码的O(n ^ 2)时间复杂度有所改进:
from bisect import bisect
for type in main.keys() & compare.keys():
for entry in compare[type]:
main_entries = main[type]
for match in main_entries[bisect([d['rate'] for d in main_entries], entry['rate']):]:
print(match['id'], entry['id'])
This outputs:这输出:
4 120
5 120
Demo: https://repl.it/repls/EasygoingReadyTechnologies演示: https://repl.it/repls/EasygoingReadyTechnologies
Disclaimer: This may look like an implementation of @ThomasWeller's solution but I actually did not see his answer until I finished my coding, which was interrupted by my other work.免责声明:这可能看起来像是@ThomasWeller 解决方案的实现,但实际上直到我完成编码后我才看到他的答案,这被我的其他工作打断了。 Also @ThomasWeller wants to sort the two lists by
type
, which would incur an O(n log n) time complexity, when it can be done in linear time as shown in the for entry in lst
loop in my code. @ThomasWeller 还希望按
type
对两个列表进行排序,这将导致O(n log n)时间复杂度,当它可以在线性时间内完成时,如我的代码for entry in lst
所示。
This looks like a job for sqlite - it's the kind of thing databases are totally optimized for.这看起来像是 sqlite 的工作 - 这是数据库完全优化的那种东西。 Python has very nice bindings to sqlite, so it should fit nicely.
Python 与 sqlite 有很好的绑定,所以它应该很适合。
Here's a starting point...这是一个起点...
import sqlite3
c = None
try:
c = sqlite3.connect(':memory:')
c.execute('create table main ( id integer primary key, rate integer not null, type text not null );')
main = [{'id': 1,'rate': 13,'type': 'C'}, {'id': 2,'rate': 39,'type': 'A'}]
for e in main:
c.execute('insert into main (id, rate, type) VALUES (' + str(e['id']) + ', ' +
str(e['rate']) + ',\"' + e['type'] + '\")')
# now for the query
# exercise left for the OP (but does require some SQL expertise)
except Error as e:
print(e)
finally:
if c:
c.close()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.