简体   繁体   English

在python中优化for循环

[英]Optimizing a for-loop in python

I have a code that puts its data and declare it in a dictionary. 我有一个代码,将其数据放入字典中并声明。 I am currently having a long time in my for loop that is about 200,000 thousand datas taking about 2 hours. 我目前在for循环中的时间很长,大约需要200万个数据,大约需要2个小时。 And now I am thinking what more if I have 2 million datas. 现在,我正在考虑如果我有200万个数据。

Here is my for loop example(Sorry for the naming of variables, this just my sample code): 这是我的for循环示例(很抱歉为变量命名,这只是我的示例代码):

# Gets the data in database
data_list = self.my_service.get_database_list()

my_dict_list = {}

for item in data_list:
    primary_key = item.primarykey
    value = item.name + item.address + item.age

    my_dict_list[primary_key] = value

This is my model/db get code: 这是我的模型/数据库获取代码:

def get_database_list(self):
    return self.session.query(
        self.mapper.name,
        self.mapper.addreess,
        self.mapper.age,
        )

My database engine is InnoDB . 我的数据库引擎是InnoDB Is there a way to make it a bit optimize or loop through datas faster. 有没有一种方法可以使其稍微优化或更快地遍历数据。 Thank you for sharing. 感谢你的分享。

First, I doubt your bottleneck (several hours) lies in python part. 首先,我怀疑您的瓶颈(几个小时)在于python部分。 You can get some improvements with generators and dict comprehensions, but by how much? 您可以使用生成器和dict理解来获得一些改进,但是要提高多少呢? Look for a sample for 200 000 rows: 寻找200,000行的样本:

import base64
import os
def random_ascii_string(srt_len):
    return base64.urlsafe_b64encode(os.urandom(3*srt_len))[0:srt_len]

>>> data = [{'id': x, 'name': random_ascii_string(10), 'age':'%s' % x,
             'address': random_ascii_string(20)} for x in xrange(2*10**5)]

Your approach 你的方法

>>> timeit.timeit("""
... from __main__ import data
... my_dict_list = {}
... for item in data:
...     my_dict_list[item['id']] = item['name'] + item['address'] + item['age']""",
...         number = 100)
16.727806467023015

List comprehension 清单理解

>>> timeit.timeit("from __main__ import data; "
...    "my_dict_list = { d['id']: d['name']+d['address']+d['age'] for d in data}",
...     number = 100)
14.474646358685249

I doubt you can find two hours in those optimisation. 我怀疑您会在这些优化中找到两个小时。 So your first task is to find your bottleneck. 因此,您的首要任务是找到瓶颈。 I advise you to have a look at MySQL part of your job, and probably redisign it to: 我建议您看一下工作中的MySQL部分,并可能将其重新分配给:

  • use a separate inno db file per table 每个表使用单独的inno db文件
  • use indexes if retrieving smaller part of data 如果检索较小部分的数据,请使用索引
  • make some evaluations at db side, such as name + address + age 在数据库端进行一些评估,例如name + address + age
  • do not make processing for the whole data, retrieve only needed part (several first rows) 不对整个数据进行处理,仅检索所需的部分(前几行)

It's hard to just guess where your code spends the most time. 很难猜测您的代码在哪里花费最多的时间。 The best thing to do is to run it using cProfile , and examine the results. 最好的办法是使用cProfile运行它,并检查结果。

python -m cProfile -o prof <your_script> <args...>

This outputs a file named prof , which you can examine in various ways, the coolest of which is using runsnakerun . 这将输出一个名为prof的文件,您可以通过多种方式进行检查,其中最酷的方法是使用runtimenakerun

Other than that, off the top of head, dict-comrehension is often faster than the alternatives: 除此之外,对dict的理解通常比其他方法要快:

my_dict_list = { item.primarykey: item.name + item.address + item.age }

Also, it is not exactly clear what item.name + item.address + item.age does (are they all strings?), but if you can consider changing your data structure, and storing item instead of that combined value, it might help further. 此外,还不清楚是否要执行item.name + item.address + item.age (它们都是字符串吗?),但是如果您可以考虑更改数据结构,并存储item而不是该组合值,则可能会有所帮助进一步。

Agreed with above comments on iterators. 同意以上关于迭代器的评论。 You could try using a dictionary comprehension in place of the loop. 您可以尝试使用字典理解代替循环。

import uuid
import time

class mock:
    def __init__(self):
        self.name = "foo"
        self.address = "address"
        self.age = "age"
        self.primarykey = uuid.uuid4()

data_list = [mock() for x in range(2000000)]

my_dict_list = {}
t1 = time.time()
for item in data_list:
    primary_key = item.primarykey
    value = item.name + item.address + item.age
    my_dict_list[primary_key] = value
print(time.time() - t1)


my_dict_list = {}
t2 = time.time()
new_dict = { item.primarykey: item.name + item.address + item.age for item in data_list }
print(time.time() - t2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM