简体   繁体   English

sqlalchemy批量更新性能问题

[英]sqlalchemy bulk update performance problems

I need to increment values in a column periodically with data I receive in a file. 我需要定期使用文件中收到的数据增加列中的值。 The table has > 400000 rows. 该表有> 400000行。 So far, all my attempts result in very poor performance. 到目前为止,我的所有尝试都会导致性能非常差。 I have written an experiment that reflects my requirements: 我写了一个反映我要求的实验:

#create table
engine = create_engine('sqlite:///bulk_update.db', echo=False)
metadata = MetaData()

sometable = Table('sometable',  metadata,
    Column('id', Integer, Sequence('sometable_id_seq'), primary_key=True),
    Column('column1', Integer),
    Column('column2', Integer),
)

sometable.create(engine, checkfirst=True)

#initial population
conn = engine.connect()
nr_of_rows = 50000
insert_data = [ { 'column1': i, 'column2' : 0 } for i in range(1, nr_of_rows)]
result = conn.execute(sometable.insert(), insert_data)

#update
update_data = [ {'col1' : i, '_increment': randint(1, 500)} for i in range(1, nr_of_rows)]

print "nr_of_rows", nr_of_rows
print "start time   : " + str(datetime.time(datetime.now()))

stmt = sometable.update().\
        where(sometable.c.column1 == bindparam('col1')).\
        values({sometable.c.column2 : sometable.c.column2 +     bindparam('_increment')})

conn.execute(stmt, update_data)

print "end time : " + str(datetime.time(datetime.now()))

the times I get are these: 我得到的时间是这些:

nr_of_rows 10000
start time  : 10:29:01.753938
end time    : 10:29:16.247651

nr_of_rows 50000
start time  : 10:30:35.236852
end time    : 10:36:39.070423

so doing a 400000+ amount of rows will take much too long. 所以做400000多行会花费太长时间。

I am new to sqlalchemy, but I did do a lot of doc reading, and I just can't understand what I am doing wrong. 我是sqlalchemy的新手,但我确实做过很多文档阅读,而我却无法理解我做错了什么。

thanks in advance! 提前致谢!

You are using the correct approach by doing bulk update with single query. 您通过使用单个查询进行批量更新来使用正确的方法。

The reason why it takes that long is because the table doesn't have index on the sometable.column1 . 之所以花费那么长是因为表没有sometable.column1上的索引。 It has only primary index on column id . 它只有列id主索引。

Your update query uses sometable.column1 in where clause to identify record. 您的更新查询使用where子句中的sometable.column1来标识记录。 So database has to scan through the all table records for every single column update. 因此,数据库必须扫描每个列更新的所有表记录。

To make update run much faster you need to update your table schema definition code to add index creation to the column1 definition with , index=True : 要使更新运行更快,您需要更新表模式定义代码,以使用, index=True将索引创建添加到column1定义:

sometable = Table('sometable',  metadata,
    Column('id', Integer, Sequence('sometable_id_seq'), primary_key=True),
    Column('column1', Integer, index=True),
    Column('column2', Integer),
)

I tested updated code in my machine - it took <2 seconds for the program to run. 我在我的机器上测试了更新的代码 - 程序运行花了不到2秒。

BTW kudos to your question description - you put all code needed to reproduce your problem. BTW对你的问题描述赞不绝口 - 你把所有需要的代码都重现了你的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM