在python中优化for循环

Question

I have a code that puts its data and declare it in a dictionary. 我有一个代码，将其数据放入字典中并声明。 I am currently having a long time in my for loop that is about 200,000 thousand datas taking about 2 hours. 我目前在for循环中的时间很长，大约需要200万个数据，大约需要2个小时。 And now I am thinking what more if I have 2 million datas. 现在，我正在考虑如果我有200万个数据。

Here is my for loop example(Sorry for the naming of variables, this just my sample code): 这是我的for循环示例（很抱歉为变量命名，这只是我的示例代码）：

# Gets the data in database
data_list = self.my_service.get_database_list()

my_dict_list = {}

for item in data_list:
    primary_key = item.primarykey
    value = item.name + item.address + item.age

    my_dict_list[primary_key] = value

This is my model/db get code: 这是我的模型/数据库获取代码：

def get_database_list(self):
    return self.session.query(
        self.mapper.name,
        self.mapper.addreess,
        self.mapper.age,
        )

My database engine is InnoDB . 我的数据库引擎是InnoDB 。 Is there a way to make it a bit optimize or loop through datas faster. 有没有一种方法可以使其稍微优化或更快地遍历数据。 Thank you for sharing. 感谢你的分享。

Answer 1

First, I doubt your bottleneck (several hours) lies in python part. 首先，我怀疑您的瓶颈（几个小时）在于python部分。 You can get some improvements with generators and dict comprehensions, but by how much? 您可以使用生成器和dict理解来获得一些改进，但是要提高多少呢？ Look for a sample for 200 000 rows: 寻找200,000行的样本：

import base64
import os
def random_ascii_string(srt_len):
    return base64.urlsafe_b64encode(os.urandom(3*srt_len))[0:srt_len]

>>> data = [{'id': x, 'name': random_ascii_string(10), 'age':'%s' % x,
             'address': random_ascii_string(20)} for x in xrange(2*10**5)]

Your approach 你的方法

>>> timeit.timeit("""
... from __main__ import data
... my_dict_list = {}
... for item in data:
...     my_dict_list[item['id']] = item['name'] + item['address'] + item['age']""",
...         number = 100)
16.727806467023015

List comprehension 清单理解

>>> timeit.timeit("from __main__ import data; "
...    "my_dict_list = { d['id']: d['name']+d['address']+d['age'] for d in data}",
...     number = 100)
14.474646358685249

I doubt you can find two hours in those optimisation. 我怀疑您会在这些优化中找到两个小时。 So your first task is to find your bottleneck. 因此，您的首要任务是找到瓶颈。 I advise you to have a look at MySQL part of your job, and probably redisign it to: 我建议您看一下工作中的MySQL部分，并可能将其重新分配给：

use a separate inno db file per table 每个表使用单独的inno db文件
use indexes if retrieving smaller part of data 如果检索较小部分的数据，请使用索引
make some evaluations at db side, such as name + address + age 在数据库端进行一些评估，例如name + address + age
do not make processing for the whole data, retrieve only needed part (several first rows) 不对整个数据进行处理，仅检索所需的部分（前几行）

Answer 2

It's hard to just guess where your code spends the most time. 很难猜测您的代码在哪里花费最多的时间。 The best thing to do is to run it using cProfile , and examine the results. 最好的办法是使用cProfile运行它，并检查结果。

python -m cProfile -o prof <your_script> <args...>

This outputs a file named prof , which you can examine in various ways, the coolest of which is using runsnakerun . 这将输出一个名为prof的文件，您可以通过多种方式进行检查，其中最酷的方法是使用runtimenakerun 。

Other than that, off the top of head, dict-comrehension is often faster than the alternatives: 除此之外，对dict的理解通常比其他方法要快：

my_dict_list = { item.primarykey: item.name + item.address + item.age }

Also, it is not exactly clear what item.name + item.address + item.age does (are they all strings?), but if you can consider changing your data structure, and storing item instead of that combined value, it might help further. 此外，还不清楚是否要执行item.name + item.address + item.age （它们都是字符串吗？），但是如果您可以考虑更改数据结构，并存储item而不是该组合值，则可能会有所帮助进一步。

Answer 3

Agreed with above comments on iterators. 同意以上关于迭代器的评论。 You could try using a dictionary comprehension in place of the loop. 您可以尝试使用字典理解代替循环。

import uuid
import time

class mock:
    def __init__(self):
        self.name = "foo"
        self.address = "address"
        self.age = "age"
        self.primarykey = uuid.uuid4()

data_list = [mock() for x in range(2000000)]

my_dict_list = {}
t1 = time.time()
for item in data_list:
    primary_key = item.primarykey
    value = item.name + item.address + item.age
    my_dict_list[primary_key] = value
print(time.time() - t1)


my_dict_list = {}
t2 = time.time()
new_dict = { item.primarykey: item.name + item.address + item.age for item in data_list }
print(time.time() - t2)

在python中优化for循环

问题描述

3 个解决方案

解决方案1
3 已采纳 2013-11-25 06:30:24

解决方案2
2 2013-11-25 06:13:13

解决方案3
0 2013-11-25 06:12:44

在python中优化for循环

问题描述

3 个解决方案

解决方案1 3 已采纳 2013-11-25 06:30:24

解决方案2 2 2013-11-25 06:13:13

解决方案3 0 2013-11-25 06:12:44

解决方案1
3 已采纳 2013-11-25 06:30:24

解决方案2
2 2013-11-25 06:13:13

解决方案3
0 2013-11-25 06:12:44