简体   繁体   中英

Poor bulk insert performance using Python 3 and SQLite

I have few text files which contain URLs. I am trying to create a SQLite database to store these URLs in a table. The URL table has two columns ie primary key(INTEGER) and URL(TEXT).

I try to insert 100,000 entries in one insert command and loop till I finish the URL list. Basically, read all the text files content and save in list and then I use create smaller list of 100,000 entries and insert in table.

Total URLs in the text files are 4,591,415 and total text file size is approx 97.5 MB.

Problems :

  1. When I chose file database, it takes around 7-7.5 minutes to insert. I feel this is not a very fast insert given that I have solid state hard-disk which has faster read/write. Along with that I have approximately 10GB RAM available as seen in task manager. Processor is i5-6300U 2.4Ghz.

  2. The total text files are approx. 97.5 MB. But after I insert the URLs in the SQLite, the SQLite database is approximately 350MB ie almost 3.5 times the original data size. Since the database doesn't contain any other tables, indexes etc. this database size looks little odd.

For problem 1, I tried playing with parameters and came up with as best ones based on test runs with different parameters.

 table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 15px; text-align: left; }
 <table style="width:100%"> <tr> <th>Configuration</th> <th>Time</th> </tr> <tr><th>50,000 - with journal = delete and no transaction </th><th>0:12:09.888404</th></tr> <tr><th>50,000 - with journal = delete and with transaction </th><th>0:22:43.613580</th></tr> <tr><th>50,000 - with journal = memory and transaction </th><th>0:09:01.140017</th></tr> <tr><th>50,000 - with journal = memory </th><th>0:07:38.820148</th></tr> <tr><th>50,000 - with journal = memory and synchronous=0 </th><th>0:07:43.587135</th></tr> <tr><th>50,000 - with journal = memory and synchronous=1 and page_size=65535 </th><th>0:07:19.778217</th></tr> <tr><th>50,000 - with journal = memory and synchronous=0 and page_size=65535 </th><th>0:07:28.186541</th></tr> <tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th>0:07:06.539198</th></tr> <tr><th>50,000 - with journal = delete and synchronous=0 and page_size=65535 </th><th>0:07:19.810333</th></tr> <tr><th>50,000 - with journal = wal and synchronous=0 and page_size=65535 </th><th>0:08:22.856690</th></tr> <tr><th>50,000 - with journal = wal and synchronous=1 and page_size=65535 </th><th>0:08:22.326936</th></tr> <tr><th>50,000 - with journal = delete and synchronous=1 and page_size=4096 </th><th>0:07:35.365883</th></tr> <tr><th>50,000 - with journal = memory and synchronous=1 and page_size=4096 </th><th>0:07:15.183948</th></tr> <tr><th>1,00,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th>0:07:13.402985</th></tr> </table>

I was checking online and saw this link https://adamyork.com/2017/07/02/fast-database-inserts-with-python-3-6-and-sqlite/ where the system is much slower than what I but still performing very well. Two things, that stood out from this link were:

  1. The table in the link had more columns than what I have.
  2. The database file didn't grow 3.5 times.

I have shared the python code and the files here: https://github.com/ksinghgithub/python_sqlite

Can someone guide me on optimizing this code. Thanks.

Environment:

  1. Windows 10 Professional on i5-6300U and 20GB RAM and 512 SSD.
  2. Python 3.7.0

Edit 1:: New performance chart based on the feedback received on UNIQUE constraint and me playing with cache size value.

self.db.execute('CREATE TABLE blacklist (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, url TEXT NOT NULL UNIQUE)')

 table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 15px; text-align: left; }
 <table> <tr> <th>Configuration</th> <th>Action</th> <th>Time</th> <th>Notes</th> </tr> <tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:18.011823</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:25.692283</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th></th><th>0:07:13.402985</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 4096</th><th></th><th>0:04:47.624909</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th></th><<th>0:03:32.473927</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:17.927050</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL</th><th>0:00:21.804679</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL & ID</th><th>0:00:14.062386</th><th>Size reduced to 134MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL & DELETE ID</th><th>0:00:11.961004</th><th>Size reduced to 134MB from 350MB</th><th></th></tr> </table>

SQLite uses auto-commit mode by default. This permits begin transaction be to omitted. But here we want all the inserts to be in a transaction and the only way to do that is to start a transaction with begin transaction so that all the statements that are going to be ran are all in that transaction.

The method executemany is only a loop over execute done outside Python that calls the SQLite prepare statement function only once.

The following is a really bad way to remove the last N items from a list:

    templist = []
    i = 0
    while i < self.bulk_insert_entries and len(urls) > 0:
        templist.append(urls.pop())
        i += 1

It is better to do this:

   templist = urls[-self.bulk_insert_entries:]
   del urls[-self.bulk_insert_entries:]
   i = len(templist)

The slice and del slice work even on an empty list.

Both might have the same complexity but 100K calls to append and pop costs a lot more than letting Python do it outside the interpreter.

The UNIQUE constraint on column "url" is creating an implicit index on the URL. That would explain the size increase.

I don't think you can populate the table and afterwards add the unique constraint.

Your bottleneck is surely the CPU. Try the following:

  1. Install toolz: pip install toolz
  2. Use this method:

     from toolz import partition_all def add_blacklist_url(self, urls): # print('add_blacklist_url:: entries = {}'.format(len(urls))) start_time = datetime.now() for batch in partition_all(100000, urls): try: start_commit = datetime.now() self.cursor.executemany('''INSERT OR IGNORE INTO blacklist(url) VALUES(:url)''', batch) end_commit = datetime.now() - start_commit print('add_blacklist_url:: total time for INSERT OR IGNORE INTO blacklist {} entries = {}'.format(len(templist), end_commit)) except sqlite3.Error as e: print("add_blacklist_url:: Database error: %s" % e) except Exception as e: print("add_blacklist_url:: Exception in _query: %s" % e) self.db.commit() time_elapsed = datetime.now() - start_time print('add_blacklist_url:: total time for {} entries = {}'.format(records, time_elapsed))

The code was not tested.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM