如何优化 Python 中的这个循环？

Question

I have to compare two lists of dicts such as below:我必须比较两个字典列表，如下所示：

main = [{'id': 1,'rate': 13,'type'= 'C'}, {'id': 2,'rate': 39,'type': 'A'}, ...]
compare = [{'id': 119, 'rate': 33, 'type': 'D'}, {'id': 120, 'rate': 94, 'type': 'A'}, ...]

for m in main:
  for c in compare:
     if (m['rate'] > c['rate']) and (m['type'] == c['type']):
          # ...

The lists have around 9,000 items.这些列表有大约 9,000 项。 The above code runs around 81,000,000 times (9,000 * 9,000).上面的代码运行了大约 81,000,000 次 (9,000 * 9,000)。 How can I speed this up?我怎样才能加快速度？

Answer 1

You could first sort or split the lists by type and perform the comparisons per type only.您可以首先按类型对列表进行排序或拆分，然后仅对每种类型执行比较。 The question then is: how many operations do you need for sorting or splitting and how many for comparison.那么问题是：排序或拆分需要多少操作，比较需要多少操作。 Remember that there are quite efficient sort algorithms.请记住，有非常有效的排序算法。

The next optimizazion could be sorting by rate.下一个优化可能是按速率排序。 That way you can break the loop when the condition m['rate'] > c['rate'] is not satisfied any more.这样，当条件m['rate'] > c['rate']不再满足时，您可以中断循环。 In fact, you can even do a command and conquer algorithm.事实上，你甚至可以做一个命令和征服算法。

Last not least, you might benefit from Why is processing a sorted array faster than processing an unsorted array?最后同样重要的是，您可能会受益于为什么处理排序数组比处理未排序数组更快？ , which is not an algorithmic improvement, but can still make a huge difference. ，这不是算法上的改进，但仍然可以产生巨大的影响。

Let me generate a dataset with 9000 items (in the future, you may want to include such a thing in your question, since it makes our life easier):让我生成一个包含 9000 个项目的数据集（将来，您可能希望在您的问题中包含这样的内容，因为它让我们的生活更轻松）：

import random
types = ["A", "B", "C", "D", "E", "F"]
main=[]
compare = []
for i in range(9000):
    main.append({'id':random.randint(0,20000), 'rate':random.random()*500, 'type':types[random.randint(0,5)]})
    compare.append({'id': random.randint(0, 20000), 'rate': random.random() * 500, 'type': types[random.randint(0, 5)]})

Running this with a loop like用一个像这样的循环运行它

import time
start = time.time()
cycles = 0
for m in main:
  for c in compare:
      cycles += 1
      if (m['rate'] > c['rate']) and (m['type'] == c['type']):
          pass
end = time.time()
print("Total number of cycles "+str(cycles))
print("Seconds taken: " + str(end - start))

it results (on my machine) in 81M cycles and ~30 seconds.它（在我的机器上）导致 81M 周期和约 30 秒。

Splitting by type might look like this:按类型拆分可能如下所示：

# Split by types
mainsplit = {}
compsplit = {}
for t in types:
    cycles += 1
    mainsplit[t] = []
    compsplit[t] = []
for m in main:
    cycles += 1
    mainsplit[m["type"]].append(m)
for c in compare:
    cycles += 1
    compsplit[c["type"]].append(c)

# Then go through it by type
for t in types:
    for m in mainsplit[t]:
        for c in compsplit[t]:
            cycles += 1
            if m['rate'] > c['rate']:
                pass

This gives ~14M cycles and only ~4 s.这给出了~14M 个周期，并且只有~4 秒。

Sorting the partial results by "rate" and finding a lower limit for "rate":按 "rate" 对部分结果进行排序并找到 "rate" 的下限：

# Then go through it by type
for t in types:
    mainsplit[t].sort(key=lambda i:i["rate"])
    compsplit[t].sort(key=lambda i:i["rate"])
    start_of_m_in_c = 0
    for m in mainsplit[t]:
        for nc in range(start_of_m_in_c, len(compsplit[t])):
            cycles += 1
            if m["rate"] > compsplit[t][nc]["rate"]:
                pass
            else:
                start_of_m_in_c = nc

Cycles is now 36000 (not counting the cycles used by the sort algorithm) and the time to 30 ms. Cycles 现在是 36000（不计算排序算法使用的周期），时间为 30 ms。

All in all, that's a performance increase of a factor 1000.总而言之，这是 1000 倍的性能提升。

Answer 2

Given:鉴于：

main = [
    {'id': 1, 'rate': 13, 'type': 'C'},
    {'id': 2, 'rate': 39, 'type': 'A'},
    {'id': 3, 'rate': 94, 'type': 'A'},
    {'id': 4, 'rate': 95, 'type': 'A'},
    {'id': 5, 'rate': 96, 'type': 'A'}
]
compare = [
    {'id': 119, 'rate': 33, 'type': 'D'},
    {'id': 120, 'rate': 94, 'type': 'A'}
]

You can first map the two lists of dicts into two dicts of lists of dicts indexed by type , and sort sub-lists by rate :您可以先 map 将两个字典列表转换为按type索引的字典列表的两个字典，并按rate对子列表进行排序：

mappings = []
for lst in main, compare:
    mappings.append({})
    for entry in lst:
        mappings[-1].setdefault(entry['type'], []).append(entry)
    for entries in mappings[-1].values():
        entries.sort(key=lambda entry: entry['rate'])
main, compare = mappings

so that main becomes:所以main变成：

{'C': [{'id': 1, 'rate': 13, 'type': 'C'}],
 'A': [{'id': 2, 'rate': 39, 'type': 'A'},
       {'id': 3, 'rate': 94, 'type': 'A'},
       {'id': 4, 'rate': 95, 'type': 'A'},
       {'id': 5, 'rate': 96, 'type': 'A'}]}

while compare becomes:而compare变成：

{'D': [{'id': 119, 'rate': 33, 'type': 'D'}],
 'A': [{'id': 120, 'rate': 94, 'type': 'A'}]}

so that you iterate through the matching types of the two dicts in linear time, and use bisect to find the index in each sub-list of main where the rate is greater than that of compare , which takes a time complexity of O(log n) , and then iterate through the rest of the sub-list from that index for processing.这样你就可以在线性时间内遍历两个字典的匹配类型，并使用bisect在main的每个子列表中查找rate大于compare的索引，这需要O(log n ) ，然后从该索引遍历子列表的 rest 进行处理。 Overall this algorithm is of O(n log n) in time complexity, an improvement over the O(n ^ 2) time complexity of your original code:总体而言，该算法的时间复杂度为O(n log n) ，比原始代码的O(n ^ 2)时间复杂度有所改进：

from bisect import bisect

for type in main.keys() & compare.keys():
    for entry in compare[type]:
        main_entries = main[type]
        for match in main_entries[bisect([d['rate'] for d in main_entries], entry['rate']):]:
            print(match['id'], entry['id'])

This outputs:这输出：

4 120
5 120

Demo: https://repl.it/repls/EasygoingReadyTechnologies演示： https://repl.it/repls/EasygoingReadyTechnologies

Disclaimer: This may look like an implementation of @ThomasWeller's solution but I actually did not see his answer until I finished my coding, which was interrupted by my other work.免责声明：这可能看起来像是@ThomasWeller 解决方案的实现，但实际上直到我完成编码后我才看到他的答案，这被我的其他工作打断了。 Also @ThomasWeller wants to sort the two lists by type , which would incur an O(n log n) time complexity, when it can be done in linear time as shown in the for entry in lst loop in my code. @ThomasWeller 还希望按type对两个列表进行排序，这将导致O(n log n)时间复杂度，当它可以在线性时间内完成时，如我的代码for entry in lst所示。

Answer 3

This looks like a job for sqlite - it's the kind of thing databases are totally optimized for.这看起来像是 sqlite 的工作 - 这是数据库完全优化的那种东西。 Python has very nice bindings to sqlite, so it should fit nicely. Python 与 sqlite 有很好的绑定，所以它应该很适合。

Here's a starting point...这是一个起点...

import sqlite3

c = None
try:
    c = sqlite3.connect(':memory:')
    c.execute('create table main ( id integer primary key, rate integer not null,   type text not null );')
    main = [{'id': 1,'rate': 13,'type': 'C'}, {'id': 2,'rate': 39,'type': 'A'}]
    for e in main:
        c.execute('insert into main (id, rate, type) VALUES (' + str(e['id']) + ',  ' +
                    str(e['rate']) + ',\"' + e['type'] + '\")')
    # now for the query
    # exercise left for the OP (but does require some SQL expertise)
except Error as e:
    print(e)
finally:
    if c:
        c.close()

Answer 4

You can use PyPy interpretator instead of classic Cpython.您可以使用PyPy解释器而不是经典的 Cpython。 It can give you abaout 80% speedup它可以为您提供大约 80% 的加速

如何优化 Python 中的这个循环？

问题描述

4 个解决方案

解决方案1
3 2019-09-20 16:18:19

解决方案2
1 2019-09-20 18:13:53

解决方案3
0 2019-09-20 16:08:05

解决方案4
0 2019-09-20 16:20:56

如何优化 Python 中的这个循环？

问题描述

4 个解决方案

解决方案1 3 2019-09-20 16:18:19

解决方案2 1 2019-09-20 18:13:53

解决方案3 0 2019-09-20 16:08:05

解决方案4 0 2019-09-20 16:20:56

解决方案1
3 2019-09-20 16:18:19

解决方案2
1 2019-09-20 18:13:53

解决方案3
0 2019-09-20 16:08:05

解决方案4
0 2019-09-20 16:20:56