Optimizing a for-loop in python

Question

I have a code that puts its data and declare it in a dictionary. I am currently having a long time in my for loop that is about 200,000 thousand datas taking about 2 hours. And now I am thinking what more if I have 2 million datas.

Here is my for loop example(Sorry for the naming of variables, this just my sample code):

# Gets the data in database
data_list = self.my_service.get_database_list()

my_dict_list = {}

for item in data_list:
    primary_key = item.primarykey
    value = item.name + item.address + item.age

    my_dict_list[primary_key] = value

This is my model/db get code:

def get_database_list(self):
    return self.session.query(
        self.mapper.name,
        self.mapper.addreess,
        self.mapper.age,
        )

My database engine is InnoDB . Is there a way to make it a bit optimize or loop through datas faster. Thank you for sharing.

Answer 1

First, I doubt your bottleneck (several hours) lies in python part. You can get some improvements with generators and dict comprehensions, but by how much? Look for a sample for 200 000 rows:

import base64
import os
def random_ascii_string(srt_len):
    return base64.urlsafe_b64encode(os.urandom(3*srt_len))[0:srt_len]

>>> data = [{'id': x, 'name': random_ascii_string(10), 'age':'%s' % x,
             'address': random_ascii_string(20)} for x in xrange(2*10**5)]

Your approach

>>> timeit.timeit("""
... from __main__ import data
... my_dict_list = {}
... for item in data:
...     my_dict_list[item['id']] = item['name'] + item['address'] + item['age']""",
...         number = 100)
16.727806467023015

List comprehension

>>> timeit.timeit("from __main__ import data; "
...    "my_dict_list = { d['id']: d['name']+d['address']+d['age'] for d in data}",
...     number = 100)
14.474646358685249

I doubt you can find two hours in those optimisation. So your first task is to find your bottleneck. I advise you to have a look at MySQL part of your job, and probably redisign it to:

use a separate inno db file per table
use indexes if retrieving smaller part of data
make some evaluations at db side, such as name + address + age
do not make processing for the whole data, retrieve only needed part (several first rows)

Answer 2

It's hard to just guess where your code spends the most time. The best thing to do is to run it using cProfile , and examine the results.

python -m cProfile -o prof <your_script> <args...>

This outputs a file named prof , which you can examine in various ways, the coolest of which is using runsnakerun .

Other than that, off the top of head, dict-comrehension is often faster than the alternatives:

my_dict_list = { item.primarykey: item.name + item.address + item.age }

Also, it is not exactly clear what item.name + item.address + item.age does (are they all strings?), but if you can consider changing your data structure, and storing item instead of that combined value, it might help further.

Answer 3

Agreed with above comments on iterators. You could try using a dictionary comprehension in place of the loop.

import uuid
import time

class mock:
    def __init__(self):
        self.name = "foo"
        self.address = "address"
        self.age = "age"
        self.primarykey = uuid.uuid4()

data_list = [mock() for x in range(2000000)]

my_dict_list = {}
t1 = time.time()
for item in data_list:
    primary_key = item.primarykey
    value = item.name + item.address + item.age
    my_dict_list[primary_key] = value
print(time.time() - t1)


my_dict_list = {}
t2 = time.time()
new_dict = { item.primarykey: item.name + item.address + item.age for item in data_list }
print(time.time() - t2)

Optimizing a for-loop in python

Question

3 answers

solution1
3 ACCPTED 2013-11-25 06:30:24

solution2
2 2013-11-25 06:13:13

solution3
0 2013-11-25 06:12:44

Optimizing a for-loop in python

Question

3 answers

solution1 3 ACCPTED 2013-11-25 06:30:24

solution2 2 2013-11-25 06:13:13

solution3 0 2013-11-25 06:12:44

solution1
3 ACCPTED 2013-11-25 06:30:24

solution2
2 2013-11-25 06:13:13

solution3
0 2013-11-25 06:12:44