简体   繁体   English

减少python中sqlite3执行/fetchone的时间

[英]Reducing time of sqlite3 execute/fetchone in python

Context语境

I'm working with several files in a proprietary format that store the results of a power system solution.我正在处理几个以专有格式存储电力系统解决方案结果的文件。 The data is formatted fairly simply, but each result file is ~50MB.数据的格式相当简单,但每个结果文件约为 50MB。 There is an API provided to query the file format, but I need to do lots of queries, and the API is horrendously slow.提供了一个 API 来查询文件格式,但我需要做很多查询,而且这个 API 非常慢。

I wrote a program to compare several of these files to each other using the API, and left it running for a couple of hours to no avail.我编写了一个程序来使用 API 将这些文件中的几个相互比较,然后让它运行几个小时无济于事。 My next thought was to do a single pass over the file, store the data I need into a sqlite3 database, and then query that.我的下一个想法是对文件进行一次传递,将我需要的数据存储到 sqlite3 数据库中,然后进行查询。 That got me a result in 20 minutes.这让我在 20 分钟内得到了结果。 Much better.好多了。 Restructured the data to avoid JOINs where possible: 12 minutes.重组数据以尽可能避免 JOIN:12 分钟。 Stored the .db file in a temporary local location instead of on the network: 8.5 minutes.将 .db 文件存储在临时本地位置而不是网络上:8.5 分钟。

Further Improvement进一步改进

The program is more or less tolerable at it's current speed, but this program will be running many, many times per day when it's completed.该程序以目前的速度或多或少是可以忍受的,但是当它完成时,该程序每天将运行很多很多次。 At the moment, 62% of the run time is spent on 721 calls of .execute/.fetchone.目前,62% 的运行时间花费在 .execute/.fetchone 的 721 次调用上。

      160787763 function calls (160787745 primitive calls) in 503.061 seconds
Ordered by: internal time
List reduced from 1507 to 20 due to restriction <20>
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   721  182.869    0.254  182.869    0.254 {method 'fetchone' of 'sqlite3.Cursor' objects}
   721  129.355    0.179  129.355    0.179 {method 'execute' of 'sqlite3.Cursor' objects}
 24822   45.734    0.002   47.600    0.002 {method 'executemany' of 'sqlite3.Connection' objects}

Since so much time is spent in this small section, I thought I would ask for any ideas to improve it before I move forward.由于在这个小部分花费了太多时间,我想我会在继续之前询问任何改进它的想法。 I feel like I may be missing something simple a more experienced eye will catch.我觉得我可能会错过一些简单的东西,更有经验的眼睛会注意到。 This particular part of the program is basically structured like this:该程序的这一特定部分的结构基本上是这样的:

for i, db in enumerate(dbs):
    for key, vals in dict.iteritems():
        # If it already has a value, no need to get a comparison value
        if not vals[i]:
            solution_id = key[0]
            num = key[1]

            # Only get a comparison value if the solution is valid for the current db
            if solution_id in db.valid_ids:
                db.cur.execute("""SELECT value FROM table WHERE solution == ? AND num == ?""",
                               (solution_id, num))
                try:
                    vals[i] = db.cur.fetchone()[0]
                # .fetchone() could have returned None, no __getitem__
                except TypeError:
                    pass

The dict structure is:字典结构是:

dict = {(solution_id, num): [db1_val, db2_val, db3_val, db4_val]}

Every entry has at least one db_val, the others are None .每个条目至少有一个 db_val,其他都是None The purpose of the loop above is to fill every db_val spot that can be filled, so you can compare values.上面循环的目的是填充每个可以填充的 db_val 点,以便您可以比较值。

The Question问题

I've read that sqlite3 SELECT statements can only be executed with .execute, so that removes my ability to use .executemany (which saved me tons of time on INSERTS).我读过 sqlite3 SELECT 语句只能用 .execute 执行,因此我无法使用 .executemany(这为我节省了大量的插入时间)。 I've also read on the python docs that using .execute directly from the connection object can be more efficient, but I can't do that since I need to fetch the data.我还阅读了 python 文档,直接从连接对象使用 .execute 可以更有效,但我不能这样做,因为我需要获取数据。

Is there a better way to structure the loop, or the query, to minimize the amount of time spent on .execute and .fetchone statements?有没有更好的方法来构建循环或查询,以最大限度地减少在 .execute 和 .fetchone 语句上花费的时间?

The Answer答案

Based on the answers provided by CL and rocksportrocker, I changed my table create statement (simplified version) from:根据 CL 和 Rocksportrocker 提供的答案,我将 table create 语句(简化版)从:

CREATE TABLE table(
solution integer, num integer, ..., value real,
FOREIGN KEY (solution) REFERENCES solution (id),
FOREIGN KEY (num) REFERENCES nums (id)
);

to:到:

CREATE TABLE table(
solution integer, num integer, ..., value real,
PRIMARY KEY (solution, num),
FOREIGN KEY (solution) REFERENCES solution (id),
FOREIGN KEY (num) REFERENCES nums (id)
) WITHOUT ROWID;

In my test case,在我的测试用例中,

  • File sizes remained the same文件大小保持不变
  • The .executemany INSERT statements increased from ~46 to ~69 seconds .executemany INSERT 语句从 ~46 秒增加到 ~69 秒
  • The .execute SELECT statements decreased from ~129 to ~5 seconds .execute SELECT 语句从 ~129 减少到 ~5 秒
  • The .fetchone statements decreased from ~183 to ~0 seconds .fetchone 语句从 ~183 减少到 ~0 秒
  • Total time reduced from ~503 seconds to ~228 seconds, 45% of the original time总时间从 ~503 秒减少到 ~228 秒,原始时间的 45%

Any other improvements are still welcomed, hopefully this can become a good reference question for others who are new to SQL.仍然欢迎任何其他改进,希望这可以成为其他 SQL 新手的一个很好的参考问题。

The execute() and fetchone() calls are where the database does all its work. execute()fetchone()调用是数据库完成所有工作的地方。

To speed up the query, the lookup columns must be indexed.为了加快查询速度,必须对查找列进行索引。 To save space, you can use a clustered index, ie, make the table a WITHOUT ROWID table .为了节省空间,您可以使用聚集索引,即,使表成为WITHOUT ROWID 表

Did you consider to introuce an index on the solution column ?您是否考虑在solution列上引入索引? Would increase insertion time and size of the .db file.会增加.db文件的插入时间和大小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM