简体   繁体   中英

Reducing time of sqlite3 execute/fetchone in python

Context

I'm working with several files in a proprietary format that store the results of a power system solution. The data is formatted fairly simply, but each result file is ~50MB. There is an API provided to query the file format, but I need to do lots of queries, and the API is horrendously slow.

I wrote a program to compare several of these files to each other using the API, and left it running for a couple of hours to no avail. My next thought was to do a single pass over the file, store the data I need into a sqlite3 database, and then query that. That got me a result in 20 minutes. Much better. Restructured the data to avoid JOINs where possible: 12 minutes. Stored the .db file in a temporary local location instead of on the network: 8.5 minutes.

Further Improvement

The program is more or less tolerable at it's current speed, but this program will be running many, many times per day when it's completed. At the moment, 62% of the run time is spent on 721 calls of .execute/.fetchone.

      160787763 function calls (160787745 primitive calls) in 503.061 seconds
Ordered by: internal time
List reduced from 1507 to 20 due to restriction <20>
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   721  182.869    0.254  182.869    0.254 {method 'fetchone' of 'sqlite3.Cursor' objects}
   721  129.355    0.179  129.355    0.179 {method 'execute' of 'sqlite3.Cursor' objects}
 24822   45.734    0.002   47.600    0.002 {method 'executemany' of 'sqlite3.Connection' objects}

Since so much time is spent in this small section, I thought I would ask for any ideas to improve it before I move forward. I feel like I may be missing something simple a more experienced eye will catch. This particular part of the program is basically structured like this:

for i, db in enumerate(dbs):
    for key, vals in dict.iteritems():
        # If it already has a value, no need to get a comparison value
        if not vals[i]:
            solution_id = key[0]
            num = key[1]

            # Only get a comparison value if the solution is valid for the current db
            if solution_id in db.valid_ids:
                db.cur.execute("""SELECT value FROM table WHERE solution == ? AND num == ?""",
                               (solution_id, num))
                try:
                    vals[i] = db.cur.fetchone()[0]
                # .fetchone() could have returned None, no __getitem__
                except TypeError:
                    pass

The dict structure is:

dict = {(solution_id, num): [db1_val, db2_val, db3_val, db4_val]}

Every entry has at least one db_val, the others are None . The purpose of the loop above is to fill every db_val spot that can be filled, so you can compare values.

The Question

I've read that sqlite3 SELECT statements can only be executed with .execute, so that removes my ability to use .executemany (which saved me tons of time on INSERTS). I've also read on the python docs that using .execute directly from the connection object can be more efficient, but I can't do that since I need to fetch the data.

Is there a better way to structure the loop, or the query, to minimize the amount of time spent on .execute and .fetchone statements?

The Answer

Based on the answers provided by CL and rocksportrocker, I changed my table create statement (simplified version) from:

CREATE TABLE table(
solution integer, num integer, ..., value real,
FOREIGN KEY (solution) REFERENCES solution (id),
FOREIGN KEY (num) REFERENCES nums (id)
);

to:

CREATE TABLE table(
solution integer, num integer, ..., value real,
PRIMARY KEY (solution, num),
FOREIGN KEY (solution) REFERENCES solution (id),
FOREIGN KEY (num) REFERENCES nums (id)
) WITHOUT ROWID;

In my test case,

  • File sizes remained the same
  • The .executemany INSERT statements increased from ~46 to ~69 seconds
  • The .execute SELECT statements decreased from ~129 to ~5 seconds
  • The .fetchone statements decreased from ~183 to ~0 seconds
  • Total time reduced from ~503 seconds to ~228 seconds, 45% of the original time

Any other improvements are still welcomed, hopefully this can become a good reference question for others who are new to SQL.

The execute() and fetchone() calls are where the database does all its work.

To speed up the query, the lookup columns must be indexed. To save space, you can use a clustered index, ie, make the table a WITHOUT ROWID table .

Did you consider to introuce an index on the solution column ? Would increase insertion time and size of the .db file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM