I have a code that puts its data and declare it in a dictionary. I am currently having a long time in my for loop that is about 200,000 thousand datas taking about 2 hours. And now I am thinking what more if I have 2 million datas.
Here is my for loop example(Sorry for the naming of variables, this just my sample code):
# Gets the data in database
data_list = self.my_service.get_database_list()
my_dict_list = {}
for item in data_list:
primary_key = item.primarykey
value = item.name + item.address + item.age
my_dict_list[primary_key] = value
This is my model/db get code:
def get_database_list(self):
return self.session.query(
self.mapper.name,
self.mapper.addreess,
self.mapper.age,
)
My database engine is InnoDB . Is there a way to make it a bit optimize or loop through datas faster. Thank you for sharing.
First, I doubt your bottleneck (several hours) lies in python part. You can get some improvements with generators and dict comprehensions, but by how much? Look for a sample for 200 000 rows:
import base64
import os
def random_ascii_string(srt_len):
return base64.urlsafe_b64encode(os.urandom(3*srt_len))[0:srt_len]
>>> data = [{'id': x, 'name': random_ascii_string(10), 'age':'%s' % x,
'address': random_ascii_string(20)} for x in xrange(2*10**5)]
Your approach
>>> timeit.timeit("""
... from __main__ import data
... my_dict_list = {}
... for item in data:
... my_dict_list[item['id']] = item['name'] + item['address'] + item['age']""",
... number = 100)
16.727806467023015
List comprehension
>>> timeit.timeit("from __main__ import data; "
... "my_dict_list = { d['id']: d['name']+d['address']+d['age'] for d in data}",
... number = 100)
14.474646358685249
I doubt you can find two hours in those optimisation. So your first task is to find your bottleneck. I advise you to have a look at MySQL part of your job, and probably redisign it to:
name + address + age
It's hard to just guess where your code spends the most time. The best thing to do is to run it using cProfile , and examine the results.
python -m cProfile -o prof <your_script> <args...>
This outputs a file named prof
, which you can examine in various ways, the coolest of which is using runsnakerun .
Other than that, off the top of head, dict-comrehension is often faster than the alternatives:
my_dict_list = { item.primarykey: item.name + item.address + item.age }
Also, it is not exactly clear what item.name + item.address + item.age
does (are they all strings?), but if you can consider changing your data structure, and storing item
instead of that combined value, it might help further.
Agreed with above comments on iterators. You could try using a dictionary comprehension in place of the loop.
import uuid
import time
class mock:
def __init__(self):
self.name = "foo"
self.address = "address"
self.age = "age"
self.primarykey = uuid.uuid4()
data_list = [mock() for x in range(2000000)]
my_dict_list = {}
t1 = time.time()
for item in data_list:
primary_key = item.primarykey
value = item.name + item.address + item.age
my_dict_list[primary_key] = value
print(time.time() - t1)
my_dict_list = {}
t2 = time.time()
new_dict = { item.primarykey: item.name + item.address + item.age for item in data_list }
print(time.time() - t2)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.