[英]Poor bulk insert performance using Python 3 and SQLite
我有幾個包含 URL 的文本文件。 我正在嘗試創建一個 SQLite 數據庫來將這些 URL 存儲在一個表中。 URL 表有兩列,即主鍵(INTEGER) 和URL(TEXT)。
我嘗試在一個插入命令中插入 100,000 個條目並循環,直到我完成 URL 列表。 基本上,讀取所有文本文件內容並保存在列表中,然后我使用創建 100,000 個條目的較小列表並插入表中。
文本文件中的總 URL 為 4,591,415,總文本文件大小約為 97.5 MB。
問題:
當我選擇文件數據庫時,插入大約需要7-7.5 分鍾。 我覺得這不是一個非常快的插入,因為我有固態硬盤,讀/寫速度更快。 除此之外,我有大約 10GB RAM 可用,如任務管理器中所示。 處理器為 i5-6300U 2.4Ghz。
總文本文件約為。 97.5 MB。 但是在我在 SQLite 中插入 URL 之后,SQLite 數據庫大約為 350MB,即幾乎是原始數據大小的 3.5 倍。 由於數據庫不包含任何其他表、索引等,因此這個數據庫大小看起來並不奇怪。
對於問題 1,我嘗試使用參數並根據不同參數的測試運行得出最佳參數。
table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 15px; text-align: left; }
<table style="width:100%"> <tr> <th>Configuration</th> <th>Time</th> </tr> <tr><th>50,000 - with journal = delete and no transaction </th><th>0:12:09.888404</th></tr> <tr><th>50,000 - with journal = delete and with transaction </th><th>0:22:43.613580</th></tr> <tr><th>50,000 - with journal = memory and transaction </th><th>0:09:01.140017</th></tr> <tr><th>50,000 - with journal = memory </th><th>0:07:38.820148</th></tr> <tr><th>50,000 - with journal = memory and synchronous=0 </th><th>0:07:43.587135</th></tr> <tr><th>50,000 - with journal = memory and synchronous=1 and page_size=65535 </th><th>0:07:19.778217</th></tr> <tr><th>50,000 - with journal = memory and synchronous=0 and page_size=65535 </th><th>0:07:28.186541</th></tr> <tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th>0:07:06.539198</th></tr> <tr><th>50,000 - with journal = delete and synchronous=0 and page_size=65535 </th><th>0:07:19.810333</th></tr> <tr><th>50,000 - with journal = wal and synchronous=0 and page_size=65535 </th><th>0:08:22.856690</th></tr> <tr><th>50,000 - with journal = wal and synchronous=1 and page_size=65535 </th><th>0:08:22.326936</th></tr> <tr><th>50,000 - with journal = delete and synchronous=1 and page_size=4096 </th><th>0:07:35.365883</th></tr> <tr><th>50,000 - with journal = memory and synchronous=1 and page_size=4096 </th><th>0:07:15.183948</th></tr> <tr><th>1,00,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th>0:07:13.402985</th></tr> </table>
我在網上查了一下,看到這個鏈接https://adamyork.com/2017/07/02/fast-database-inserts-with-python-3-6-and-sqlite/那里的系統比我慢得多仍然表現得非常好。 從這個鏈接中脫穎而出的兩件事是:
我在這里分享了 python 代碼和文件: https : //github.com/ksinghgithub/python_sqlite
有人可以指導我優化此代碼。 謝謝。
環境:
編輯 1:: 基於收到的關於 UNIQUE 約束和我玩緩存大小值的反饋的新性能圖表。
self.db.execute('CREATE TABLE blacklist (id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, url TEXT NOT NULL UNIQUE)')
table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 15px; text-align: left; }
<table> <tr> <th>Configuration</th> <th>Action</th> <th>Time</th> <th>Notes</th> </tr> <tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:18.011823</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>50,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:25.692283</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 </th><th></th><th>0:07:13.402985</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 4096</th><th></th><th>0:04:47.624909</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th></th><<th>0:03:32.473927</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = 8192</th><th>REMOVE UNIQUE FROM URL</th><th>0:00:17.927050</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL</th><th>0:00:21.804679</th><th>Size reduced to 196MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL & ID</th><th>0:00:14.062386</th><th>Size reduced to 134MB from 350MB</th><th></th></tr> <tr><th>100,000 - with journal = delete and synchronous=1 and page_size=65535 cache_size = default </th><th>REMOVE UNIQUE FROM URL & DELETE ID</th><th>0:00:11.961004</th><th>Size reduced to 134MB from 350MB</th><th></th></tr> </table>
SQLite 默認使用自動提交模式。 這允許begin transaction
被省略。 但是在這里我們希望所有插入都在一個事務中,唯一的方法是用begin transaction
啟動一個事務,這樣所有將要運行的語句都在該事務中。
方法executemany
只是在 Python 之外execute
的循環,它只調用一次 SQLite 准備語句函數。
以下是從列表中刪除最后 N 個項目的非常糟糕的方法:
templist = []
i = 0
while i < self.bulk_insert_entries and len(urls) > 0:
templist.append(urls.pop())
i += 1
最好這樣做:
templist = urls[-self.bulk_insert_entries:]
del urls[-self.bulk_insert_entries:]
i = len(templist)
slice 和 del slice 即使在空列表上也能工作。
兩者可能具有相同的復雜性,但 10 萬次調用 append 和 pop 比讓 Python 在解釋器之外執行它的成本要高得多。
列“url”上的 UNIQUE 約束正在 URL 上創建一個隱式索引。 這將解釋尺寸增加。
我不認為你可以填充表然后添加唯一約束。
您的瓶頸肯定是 CPU。 請嘗試以下操作:
pip install toolz
使用這個方法:
from toolz import partition_all def add_blacklist_url(self, urls): # print('add_blacklist_url:: entries = {}'.format(len(urls))) start_time = datetime.now() for batch in partition_all(100000, urls): try: start_commit = datetime.now() self.cursor.executemany('''INSERT OR IGNORE INTO blacklist(url) VALUES(:url)''', batch) end_commit = datetime.now() - start_commit print('add_blacklist_url:: total time for INSERT OR IGNORE INTO blacklist {} entries = {}'.format(len(templist), end_commit)) except sqlite3.Error as e: print("add_blacklist_url:: Database error: %s" % e) except Exception as e: print("add_blacklist_url:: Exception in _query: %s" % e) self.db.commit() time_elapsed = datetime.now() - start_time print('add_blacklist_url:: total time for {} entries = {}'.format(records, time_elapsed))
代碼沒有經過測試。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.