简体   繁体   English

使用 Python 3 和 SQLite 的批量插入性能不佳

[英]Poor bulk insert performance using Python 3 and SQLite

I have few text files which contain URLs.我有几个包含 URL 的文本文件。 I am trying to create a SQLite database to store these URLs in a table.我正在尝试创建一个 SQLite 数据库来将这些 URL 存储在一个表中。 The URL table has two columns ie primary key(INTEGER) and URL(TEXT). URL 表有两列,即主键(INTEGER) 和URL(TEXT)。

I try to insert 100,000 entries in one insert command and loop till I finish the URL list.我尝试在一个插入命令中插入 100,000 个条目并循环,直到我完成 URL 列表。 Basically, read all the text files content and save in list and then I use create smaller list of 100,000 entries and insert in table.基本上,读取所有文本文件内容并保存在列表中,然后我使用创建 100,000 个条目的较小列表并插入表中。

Total URLs in the text files are 4,591,415 and total text file size is approx 97.5 MB.文本文件中的总 URL 为 4,591,415,总文本文件大小约为 97.5 MB。

Problems :问题

  1. When I chose file database, it takes around 7-7.5 minutes to insert.当我选择文件数据库时,插入大约需要7-7.5 分钟 I feel this is not a very fast insert given that I have solid state hard-disk which has faster read/write.我觉得这不是一个非常快的插入,因为我有固态硬盘,读/写速度更快。 Along with that I have approximately 10GB RAM available as seen in task manager.除此之外,我有大约 10GB RAM 可用,如任务管理器中所示。 Processor is i5-6300U 2.4Ghz.处理器为 i5-6300U 2.4Ghz。

  2. The total text files are approx.总文本文件约为。 97.5 MB. 97.5 MB。 But after I insert the URLs in the SQLite, the SQLite database is approximately 350MB ie almost 3.5 times the original data size.但是在我在 SQLite 中插入 URL 之后,SQLite 数据库大约为 350MB,即几乎是原始数据大小的 3.5 倍。 Since the database doesn't contain any other tables, indexes etc. this database size looks little odd.由于数据库不包含任何其他表、索引等,因此这个数据库大小看起来并不奇怪。

For problem 1, I tried playing with parameters and came up with as best ones based on test runs with different parameters.对于问题 1,我尝试使用参数并根据不同参数的测试运行得出最佳参数。

 table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 15px; text-align: left; }
 <table style="width:100%"> <tr> <th>Configuration</th> <th>Time</th> </tr> <tr><th>50,000 - with journal = delete and no transaction </th><th>0:12:09.888404</th></tr> <tr><th>50,000 - with journal = delete and with transaction </th><th>0:22:43.613580</th></tr> <tr><th>50,000 - with journal = memory and transaction </th><th>0:09:01.140017</th></tr> <tr><th>50,000 - with journal = memory </th><th>0:07:38.820148</th></tr> <tr><th>50,000 - with journal = memory and synchronous=0 </th><th>0:07:43.587135</th></tr> <tr><th>50,000 - with journal = memory and synchronous=1 and page_size=65535 </th><th>0:07:19.778217</th></tr> <tr><th>50,000 - with journal = memory and synchronous=0 and page_size=65535 </th><th>0:07:28.186541</th></tr> <tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th>0:07:06.539198</th></tr> <tr><th>50,000 - with journal = delete and synchronous=0 and page_size=65535 </th><th>0:07:19.810333</th></tr> <tr><th>50,000 - with journal = wal and synchronous=0 and page_size=65535 </th><th>0:08:22.856690</th></tr> <tr><th>50,000 - with journal = wal and synchronous=1 and page_size=65535 </th><th>0:08:22.326936</th></tr> <tr><th>50,000 - with journal = delete and synchronous=1 and page_size=4096 </th><th>0:07:35.365883</th></tr> <tr><th>50,000 - with journal = memory and synchronous=1 and page_size=4096 </th><th>0:07:15.183948</th></tr> <tr><th>1,00,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th>0:07:13.402985</th></tr> </table>

I was checking online and saw this link https://adamyork.com/2017/07/02/fast-database-inserts-with-python-3-6-and-sqlite/ where the system is much slower than what I but still performing very well.我在网上查了一下,看到这个链接https://adamyork.com/2017/07/02/fast-database-inserts-with-python-3-6-and-sqlite/那里的系统比我慢得多仍然表现得非常好。 Two things, that stood out from this link were:从这个链接中脱颖而出的两件事是:

  1. The table in the link had more columns than what I have.链接中的表格的列数比我的多。
  2. The database file didn't grow 3.5 times.数据库文件没有增长 3.5 倍。

I have shared the python code and the files here: https://github.com/ksinghgithub/python_sqlite我在这里分享了 python 代码和文件: https : //github.com/ksinghgithub/python_sqlite

Can someone guide me on optimizing this code.有人可以指导我优化此代码。 Thanks.谢谢。

Environment:环境:

  1. Windows 10 Professional on i5-6300U and 20GB RAM and 512 SSD. i5-6300U 上的 Windows 10 Professional 和 20GB RAM 和 512 SSD。
  2. Python 3.7.0蟒蛇 3.7.0

Edit 1:: New performance chart based on the feedback received on UNIQUE constraint and me playing with cache size value.编辑 1:: 基于收到的关于 UNIQUE 约束和我玩缓存大小值的反馈的新性能图表。

self.db.execute('CREATE TABLE blacklist (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, url TEXT NOT NULL UNIQUE)')

 table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 15px; text-align: left; }
 <table> <tr> <th>Configuration</th> <th>Action</th> <th>Time</th> <th>Notes</th> </tr> <tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:18.011823</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:25.692283</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th></th><th>0:07:13.402985</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 4096</th><th></th><th>0:04:47.624909</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th></th><<th>0:03:32.473927</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:17.927050</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL</th><th>0:00:21.804679</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL & ID</th><th>0:00:14.062386</th><th>Size reduced to 134MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL & DELETE ID</th><th>0:00:11.961004</th><th>Size reduced to 134MB from 350MB</th><th></th></tr> </table>

SQLite uses auto-commit mode by default. SQLite 默认使用自动提交模式。 This permits begin transaction be to omitted.这允许begin transaction被省略。 But here we want all the inserts to be in a transaction and the only way to do that is to start a transaction with begin transaction so that all the statements that are going to be ran are all in that transaction.但是在这里我们希望所有插入都在一个事务中,唯一的方法是用begin transaction启动一个事务,这样所有将要运行的语句都在该事务中。

The method executemany is only a loop over execute done outside Python that calls the SQLite prepare statement function only once.方法executemany只是在 Python 之外execute的循环,它只调用一次 SQLite 准备语句函数。

The following is a really bad way to remove the last N items from a list:以下是从列表中删除最后 N 个项目的非常糟糕的方法:

    templist = []
    i = 0
    while i < self.bulk_insert_entries and len(urls) > 0:
        templist.append(urls.pop())
        i += 1

It is better to do this:最好这样做:

   templist = urls[-self.bulk_insert_entries:]
   del urls[-self.bulk_insert_entries:]
   i = len(templist)

The slice and del slice work even on an empty list. slice 和 del slice 即使在空列表上也能工作。

Both might have the same complexity but 100K calls to append and pop costs a lot more than letting Python do it outside the interpreter.两者可能具有相同的复杂性,但 10 万次调用 append 和 pop 比让 Python 在解释器之外执行它的成本要高得多。

The UNIQUE constraint on column "url" is creating an implicit index on the URL.列“url”上的 UNIQUE 约束正在 URL 上创建一个隐式索引。 That would explain the size increase.这将解释尺寸增加。

I don't think you can populate the table and afterwards add the unique constraint.我不认为你可以填充表然后添加唯一约束。

Your bottleneck is surely the CPU.您的瓶颈肯定是 CPU。 Try the following:请尝试以下操作:

  1. Install toolz: pip install toolz安装工具: pip install toolz
  2. Use this method:使用这个方法:

     from toolz import partition_all def add_blacklist_url(self, urls): # print('add_blacklist_url:: entries = {}'.format(len(urls))) start_time = datetime.now() for batch in partition_all(100000, urls): try: start_commit = datetime.now() self.cursor.executemany('''INSERT OR IGNORE INTO blacklist(url) VALUES(:url)''', batch) end_commit = datetime.now() - start_commit print('add_blacklist_url:: total time for INSERT OR IGNORE INTO blacklist {} entries = {}'.format(len(templist), end_commit)) except sqlite3.Error as e: print("add_blacklist_url:: Database error: %s" % e) except Exception as e: print("add_blacklist_url:: Exception in _query: %s" % e) self.db.commit() time_elapsed = datetime.now() - start_time print('add_blacklist_url:: total time for {} entries = {}'.format(records, time_elapsed))

The code was not tested.代码没有经过测试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM