简体   繁体   English

Python:sqlite3-如何加快数据库更新

[英]Python: sqlite3 - how to speed up updating of the database

I have a database, which I store as a .db file on my disk. 我有一个数据库,该数据库以.db文件的形式存储在磁盘上。 I implemented all the function neccessary for managing this database using sqlite3 . 我使用sqlite3实现了管理该数据库所需的所有功能。 However, I noticed that updating the rows in the table takes a large amount of time. 但是,我注意到更新表中的行需要大量时间。 My database has currently 608042 rows. 我的数据库当前有608042行。 The database has one table - let's call it Table1 . 该数据库有一个表-我们称之为Table1 This table consists of the following columns: 该表包括以下列:

id | name | age | address | job | phone | income

( id value is generated automaticaly while a row is inserted to the database). (将一行插入数据库时​​自动生成id值)。 After reading-in all the rows I perform some operations (ML algorithms for predicting the income) on the values from the rows, and next I have to update (for each row) the value of income (thus, for each one from 608042 rows I perform the SQL update operation). 读完所有行后,我对行中的值执行一些操作(用于预测收入的ML算法),然后我必须(针对每一行)更新income值(因此,针对608042行中的每一行)我执行SQL update操作)。 In order to update, I'm using the following function (copied from my class): 为了更新,我使用以下功能(从我的课程中复制):

def update_row(self, new_value, idkey):
    update_query = "UPDATE Table1 SET income = ? WHERE name = ?" % 
    self.cursor.execute(update_query, (new_value, idkey))
    self.db.commit()

And I call this function for each person registered in the database. 我为在数据库中注册的每个人调用此函数。

for each i out of 608042 rows:
  update_row(new_income_i, i.name)

(values of new_income_i are different for each i). (每个i的new_income_i的值都不同)。 This takes a huge amount of time, even though the dataset is not giant. 即使数据集不是很大,也要花费大量时间。 Is there any way to speed up the updating of the database? 有什么办法可以加快数据库的更新? Should I use something else than sqlite3 ? 我是否应该使用sqlite3以外的其他工具? Or should I instead of storing the database as a .db file store it in memory (using sqlite3.connect(":memory:") )? 还是应该代替将数据库存储为.db文件而是将其存储在内存中(使用sqlite3.connect(":memory:") )?

Each UPDATE statement must scan the entire table to find any row(s) that match the name. 每个UPDATE语句必须扫描整个表以查找与该名称匹配的任何行。

An index on the name column would prevent this and make the search much faster. name列上的索引可以防止这种情况并使搜索更快。 (See Query Planning and How does database indexing work? ) (请参阅查询计划数据库索引如何工作?

However, if the name column is not unique, then that value is not even suitable to find individual rows: each update with a duplicate name would modify all rows with the same name. 但是,如果name列不是唯一的,则该值甚至都不适合查找单独的行:每次使用重复名称进行的更新都会修改具有相同名称的所有行。 So you should use the id column to identify the row to be updated; 因此,您应该使用id列来标识要更新的行; and as the primary key, this column already has an implicit index. 作为主键,此列已具有隐式索引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM