简体   繁体   English

如何优化 Python 中的这个循环?

[英]How to optimize this loop in Python?

I have to compare two lists of dicts such as below:我必须比较两个字典列表,如下所示:

main = [{'id': 1,'rate': 13,'type'= 'C'}, {'id': 2,'rate': 39,'type': 'A'}, ...]
compare = [{'id': 119, 'rate': 33, 'type': 'D'}, {'id': 120, 'rate': 94, 'type': 'A'}, ...]

for m in main:
  for c in compare:
     if (m['rate'] > c['rate']) and (m['type'] == c['type']):
          # ...

The lists have around 9,000 items.这些列表有大约 9,000 项。 The above code runs around 81,000,000 times (9,000 * 9,000).上面的代码运行了大约 81,000,000 次 (9,000 * 9,000)。 How can I speed this up?我怎样才能加快速度?

You could first sort or split the lists by type and perform the comparisons per type only.您可以首先按类型对列表进行排序或拆分,然后仅对每种类型执行比较。 The question then is: how many operations do you need for sorting or splitting and how many for comparison.那么问题是:排序或拆分需要多少操作,比较需要多少操作。 Remember that there are quite efficient sort algorithms.请记住,有非常有效的排序算法。

The next optimizazion could be sorting by rate.下一个优化可能是按速率排序。 That way you can break the loop when the condition m['rate'] > c['rate'] is not satisfied any more.这样,当条件m['rate'] > c['rate']不再满足时,您可以中断循环。 In fact, you can even do a command and conquer algorithm.事实上,你甚至可以做一个命令和征服算法。

Last not least, you might benefit from Why is processing a sorted array faster than processing an unsorted array?最后同样重要的是,您可能会受益于为什么处理排序数组比处理未排序数组更快? , which is not an algorithmic improvement, but can still make a huge difference. ,这不是算法上的改进,但仍然可以产生巨大的影响。

Let me generate a dataset with 9000 items (in the future, you may want to include such a thing in your question, since it makes our life easier):让我生成一个包含 9000 个项目的数据集(将来,您可能希望在您的问题中包含这样的内容,因为它让我们的生活更轻松):

import random
types = ["A", "B", "C", "D", "E", "F"]
main=[]
compare = []
for i in range(9000):
    main.append({'id':random.randint(0,20000), 'rate':random.random()*500, 'type':types[random.randint(0,5)]})
    compare.append({'id': random.randint(0, 20000), 'rate': random.random() * 500, 'type': types[random.randint(0, 5)]})

Running this with a loop like用一个像这样的循环运行它

import time
start = time.time()
cycles = 0
for m in main:
  for c in compare:
      cycles += 1
      if (m['rate'] > c['rate']) and (m['type'] == c['type']):
          pass
end = time.time()
print("Total number of cycles "+str(cycles))
print("Seconds taken: " + str(end - start))

it results (on my machine) in 81M cycles and ~30 seconds.它(在我的机器上)导致 81M 周期和约 30 秒。

Splitting by type might look like this:按类型拆分可能如下所示:

# Split by types
mainsplit = {}
compsplit = {}
for t in types:
    cycles += 1
    mainsplit[t] = []
    compsplit[t] = []
for m in main:
    cycles += 1
    mainsplit[m["type"]].append(m)
for c in compare:
    cycles += 1
    compsplit[c["type"]].append(c)

# Then go through it by type
for t in types:
    for m in mainsplit[t]:
        for c in compsplit[t]:
            cycles += 1
            if m['rate'] > c['rate']:
                pass

This gives ~14M cycles and only ~4 s.这给出了~14M 个周期,并且只有~4 秒。

Sorting the partial results by "rate" and finding a lower limit for "rate":按 "rate" 对部分结果进行排序并找到 "rate" 的下限:

# Then go through it by type
for t in types:
    mainsplit[t].sort(key=lambda i:i["rate"])
    compsplit[t].sort(key=lambda i:i["rate"])
    start_of_m_in_c = 0
    for m in mainsplit[t]:
        for nc in range(start_of_m_in_c, len(compsplit[t])):
            cycles += 1
            if m["rate"] > compsplit[t][nc]["rate"]:
                pass
            else:
                start_of_m_in_c = nc

Cycles is now 36000 (not counting the cycles used by the sort algorithm) and the time to 30 ms. Cycles 现在是 36000(不计算排序算法使用的周期),时间为 30 ms。

All in all, that's a performance increase of a factor 1000.总而言之,这是 1000 倍的性能提升。

Given:鉴于:

main = [
    {'id': 1, 'rate': 13, 'type': 'C'},
    {'id': 2, 'rate': 39, 'type': 'A'},
    {'id': 3, 'rate': 94, 'type': 'A'},
    {'id': 4, 'rate': 95, 'type': 'A'},
    {'id': 5, 'rate': 96, 'type': 'A'}
]
compare = [
    {'id': 119, 'rate': 33, 'type': 'D'},
    {'id': 120, 'rate': 94, 'type': 'A'}
]

You can first map the two lists of dicts into two dicts of lists of dicts indexed by type , and sort sub-lists by rate :您可以先 map 将两个字典列表转换为按type索引的字典列表的两个字典,并按rate对子列表进行排序:

mappings = []
for lst in main, compare:
    mappings.append({})
    for entry in lst:
        mappings[-1].setdefault(entry['type'], []).append(entry)
    for entries in mappings[-1].values():
        entries.sort(key=lambda entry: entry['rate'])
main, compare = mappings

so that main becomes:所以main变成:

{'C': [{'id': 1, 'rate': 13, 'type': 'C'}],
 'A': [{'id': 2, 'rate': 39, 'type': 'A'},
       {'id': 3, 'rate': 94, 'type': 'A'},
       {'id': 4, 'rate': 95, 'type': 'A'},
       {'id': 5, 'rate': 96, 'type': 'A'}]}

while compare becomes:compare变成:

{'D': [{'id': 119, 'rate': 33, 'type': 'D'}],
 'A': [{'id': 120, 'rate': 94, 'type': 'A'}]}

so that you iterate through the matching types of the two dicts in linear time, and use bisect to find the index in each sub-list of main where the rate is greater than that of compare , which takes a time complexity of O(log n) , and then iterate through the rest of the sub-list from that index for processing.这样你就可以在线性时间内遍历两个字典的匹配类型,并使用bisectmain的每个子列表中查找rate大于compare的索引,这需要O(log n ) ,然后从该索引遍历子列表的 rest 进行处理。 Overall this algorithm is of O(n log n) in time complexity, an improvement over the O(n ^ 2) time complexity of your original code:总体而言,该算法的时间复杂度为O(n log n) ,比原始代码的O(n ^ 2)时间复杂度有所改进:

from bisect import bisect

for type in main.keys() & compare.keys():
    for entry in compare[type]:
        main_entries = main[type]
        for match in main_entries[bisect([d['rate'] for d in main_entries], entry['rate']):]:
            print(match['id'], entry['id'])

This outputs:这输出:

4 120
5 120

Demo: https://repl.it/repls/EasygoingReadyTechnologies演示: https://repl.it/repls/EasygoingReadyTechnologies

Disclaimer: This may look like an implementation of @ThomasWeller's solution but I actually did not see his answer until I finished my coding, which was interrupted by my other work.免责声明:这可能看起来像是@ThomasWeller 解决方案的实现,但实际上直到我完成编码后我才看到他的答案,这被我的其他工作打断了。 Also @ThomasWeller wants to sort the two lists by type , which would incur an O(n log n) time complexity, when it can be done in linear time as shown in the for entry in lst loop in my code. @ThomasWeller 还希望按type对两个列表进行排序,这将导致O(n log n)时间复杂度,当它可以在线性时间内完成时,如我的代码for entry in lst所示。

This looks like a job for sqlite - it's the kind of thing databases are totally optimized for.这看起来像是 sqlite 的工作 - 这是数据库完全优化的那种东西。 Python has very nice bindings to sqlite, so it should fit nicely. Python 与 sqlite 有很好的绑定,所以它应该很适合。

Here's a starting point...这是一个起点...

import sqlite3

c = None
try:
    c = sqlite3.connect(':memory:')
    c.execute('create table main ( id integer primary key, rate integer not null,   type text not null );')
    main = [{'id': 1,'rate': 13,'type': 'C'}, {'id': 2,'rate': 39,'type': 'A'}]
    for e in main:
        c.execute('insert into main (id, rate, type) VALUES (' + str(e['id']) + ',  ' +
                    str(e['rate']) + ',\"' + e['type'] + '\")')
    # now for the query
    # exercise left for the OP (but does require some SQL expertise)
except Error as e:
    print(e)
finally:
    if c:
        c.close()

You can use PyPy interpretator instead of classic Cpython.您可以使用PyPy解释器而不是经典的 Cpython。 It can give you abaout 80% speedup它可以为您提供大约 80% 的加速

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM