[英]Optimizing a for-loop in python
I have a code that puts its data and declare it in a dictionary. 我有一个代码,将其数据放入字典中并声明。 I am currently having a long time in my for loop that is about 200,000 thousand datas taking about 2 hours.
我目前在for循环中的时间很长,大约需要200万个数据,大约需要2个小时。 And now I am thinking what more if I have 2 million datas.
现在,我正在考虑如果我有200万个数据。
Here is my for loop example(Sorry for the naming of variables, this just my sample code): 这是我的for循环示例(很抱歉为变量命名,这只是我的示例代码):
# Gets the data in database
data_list = self.my_service.get_database_list()
my_dict_list = {}
for item in data_list:
primary_key = item.primarykey
value = item.name + item.address + item.age
my_dict_list[primary_key] = value
This is my model/db get code: 这是我的模型/数据库获取代码:
def get_database_list(self):
return self.session.query(
self.mapper.name,
self.mapper.addreess,
self.mapper.age,
)
My database engine is InnoDB . 我的数据库引擎是InnoDB 。 Is there a way to make it a bit optimize or loop through datas faster.
有没有一种方法可以使其稍微优化或更快地遍历数据。 Thank you for sharing.
感谢你的分享。
First, I doubt your bottleneck (several hours) lies in python part. 首先,我怀疑您的瓶颈(几个小时)在于python部分。 You can get some improvements with generators and dict comprehensions, but by how much?
您可以使用生成器和dict理解来获得一些改进,但是要提高多少呢? Look for a sample for 200 000 rows:
寻找200,000行的样本:
import base64
import os
def random_ascii_string(srt_len):
return base64.urlsafe_b64encode(os.urandom(3*srt_len))[0:srt_len]
>>> data = [{'id': x, 'name': random_ascii_string(10), 'age':'%s' % x,
'address': random_ascii_string(20)} for x in xrange(2*10**5)]
Your approach 你的方法
>>> timeit.timeit("""
... from __main__ import data
... my_dict_list = {}
... for item in data:
... my_dict_list[item['id']] = item['name'] + item['address'] + item['age']""",
... number = 100)
16.727806467023015
List comprehension 清单理解
>>> timeit.timeit("from __main__ import data; "
... "my_dict_list = { d['id']: d['name']+d['address']+d['age'] for d in data}",
... number = 100)
14.474646358685249
I doubt you can find two hours in those optimisation. 我怀疑您会在这些优化中找到两个小时。 So your first task is to find your bottleneck.
因此,您的首要任务是找到瓶颈。 I advise you to have a look at MySQL part of your job, and probably redisign it to:
我建议您看一下工作中的MySQL部分,并可能将其重新分配给:
name + address + age
name + address + age
It's hard to just guess where your code spends the most time. 很难猜测您的代码在哪里花费最多的时间。 The best thing to do is to run it using cProfile , and examine the results.
最好的办法是使用cProfile运行它,并检查结果。
python -m cProfile -o prof <your_script> <args...>
This outputs a file named prof
, which you can examine in various ways, the coolest of which is using runsnakerun . 这将输出一个名为
prof
的文件,您可以通过多种方式进行检查,其中最酷的方法是使用runtimenakerun 。
Other than that, off the top of head, dict-comrehension is often faster than the alternatives: 除此之外,对dict的理解通常比其他方法要快:
my_dict_list = { item.primarykey: item.name + item.address + item.age }
Also, it is not exactly clear what item.name + item.address + item.age
does (are they all strings?), but if you can consider changing your data structure, and storing item
instead of that combined value, it might help further. 此外,还不清楚是否要执行
item.name + item.address + item.age
(它们都是字符串吗?),但是如果您可以考虑更改数据结构,并存储item
而不是该组合值,则可能会有所帮助进一步。
Agreed with above comments on iterators. 同意以上关于迭代器的评论。 You could try using a dictionary comprehension in place of the loop.
您可以尝试使用字典理解代替循环。
import uuid
import time
class mock:
def __init__(self):
self.name = "foo"
self.address = "address"
self.age = "age"
self.primarykey = uuid.uuid4()
data_list = [mock() for x in range(2000000)]
my_dict_list = {}
t1 = time.time()
for item in data_list:
primary_key = item.primarykey
value = item.name + item.address + item.age
my_dict_list[primary_key] = value
print(time.time() - t1)
my_dict_list = {}
t2 = time.time()
new_dict = { item.primarykey: item.name + item.address + item.age for item in data_list }
print(time.time() - t2)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.