简体   繁体   中英

Optimizing a for-loop in python

I have a code that puts its data and declare it in a dictionary. I am currently having a long time in my for loop that is about 200,000 thousand datas taking about 2 hours. And now I am thinking what more if I have 2 million datas.

Here is my for loop example(Sorry for the naming of variables, this just my sample code):

# Gets the data in database
data_list = self.my_service.get_database_list()

my_dict_list = {}

for item in data_list:
    primary_key = item.primarykey
    value = item.name + item.address + item.age

    my_dict_list[primary_key] = value

This is my model/db get code:

def get_database_list(self):
    return self.session.query(
        self.mapper.name,
        self.mapper.addreess,
        self.mapper.age,
        )

My database engine is InnoDB . Is there a way to make it a bit optimize or loop through datas faster. Thank you for sharing.

First, I doubt your bottleneck (several hours) lies in python part. You can get some improvements with generators and dict comprehensions, but by how much? Look for a sample for 200 000 rows:

import base64
import os
def random_ascii_string(srt_len):
    return base64.urlsafe_b64encode(os.urandom(3*srt_len))[0:srt_len]

>>> data = [{'id': x, 'name': random_ascii_string(10), 'age':'%s' % x,
             'address': random_ascii_string(20)} for x in xrange(2*10**5)]

Your approach

>>> timeit.timeit("""
... from __main__ import data
... my_dict_list = {}
... for item in data:
...     my_dict_list[item['id']] = item['name'] + item['address'] + item['age']""",
...         number = 100)
16.727806467023015

List comprehension

>>> timeit.timeit("from __main__ import data; "
...    "my_dict_list = { d['id']: d['name']+d['address']+d['age'] for d in data}",
...     number = 100)
14.474646358685249

I doubt you can find two hours in those optimisation. So your first task is to find your bottleneck. I advise you to have a look at MySQL part of your job, and probably redisign it to:

  • use a separate inno db file per table
  • use indexes if retrieving smaller part of data
  • make some evaluations at db side, such as name + address + age
  • do not make processing for the whole data, retrieve only needed part (several first rows)

It's hard to just guess where your code spends the most time. The best thing to do is to run it using cProfile , and examine the results.

python -m cProfile -o prof <your_script> <args...>

This outputs a file named prof , which you can examine in various ways, the coolest of which is using runsnakerun .

Other than that, off the top of head, dict-comrehension is often faster than the alternatives:

my_dict_list = { item.primarykey: item.name + item.address + item.age }

Also, it is not exactly clear what item.name + item.address + item.age does (are they all strings?), but if you can consider changing your data structure, and storing item instead of that combined value, it might help further.

Agreed with above comments on iterators. You could try using a dictionary comprehension in place of the loop.

import uuid
import time

class mock:
    def __init__(self):
        self.name = "foo"
        self.address = "address"
        self.age = "age"
        self.primarykey = uuid.uuid4()

data_list = [mock() for x in range(2000000)]

my_dict_list = {}
t1 = time.time()
for item in data_list:
    primary_key = item.primarykey
    value = item.name + item.address + item.age
    my_dict_list[primary_key] = value
print(time.time() - t1)


my_dict_list = {}
t2 = time.time()
new_dict = { item.primarykey: item.name + item.address + item.age for item in data_list }
print(time.time() - t2)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM