简体   繁体   English

使用 SQLAlchemy ORM 批量插入

[英]Bulk insert with SQLAlchemy ORM

Is there any way to get SQLAlchemy to do a bulk insert rather than inserting each individual object.有什么方法可以让 SQLAlchemy 进行批量插入而不是插入每个单独的对象。 ie, IE,

doing:正在做:

INSERT INTO `foo` (`bar`) VALUES (1), (2), (3)

rather than:而不是:

INSERT INTO `foo` (`bar`) VALUES (1)
INSERT INTO `foo` (`bar`) VALUES (2)
INSERT INTO `foo` (`bar`) VALUES (3)

I've just converted some code to use sqlalchemy rather than raw sql and although it is now much nicer to work with it seems to be slower now (up to a factor of 10), I'm wondering if this is the reason.我刚刚将一些代码转换为使用 sqlalchemy 而不是原始 sql,虽然现在使用它更好,但现在似乎更慢(最多 10 倍),我想知道这是否是原因。

May be I could improve the situation using sessions more efficiently.也许我可以更有效地使用会话来改善这种情况。 At the moment I have autoCommit=False and do a session.commit() after I've added some stuff.目前我有autoCommit=False并在添加一些东西后执行session.commit() Although this seems to cause the data to go stale if the DB is changed elsewhere, like even if I do a new query I still get old results back?尽管如果在其他地方更改数据库,这似乎会导致数据过时,例如即使我执行新查询,我仍然会得到旧结果?

Thanks for your help!谢谢你的帮助!

SQLAlchemy introduced that in version 1.0.0 : SQLAlchemy 在1.0.0版中介绍了这一点:

Bulk operations - SQLAlchemy docs 批量操作 - SQLAlchemy 文档

With these operations, you can now do bulk inserts or updates!通过这些操作,您现在可以进行批量插入或更新!

For instance, you can do:例如,您可以执行以下操作:

s = Session()
objects = [
    User(name="u1"),
    User(name="u2"),
    User(name="u3")
]
s.bulk_save_objects(objects)
s.commit()

Here, a bulk insert will be made.在这里,将进行批量插入。

The sqlalchemy docs have a writeup on the performance of various techniques that can be used for bulk inserts: sqlalchemy 文档有一篇关于可用于批量插入的各种技术的性能的文章:

ORMs are basically not intended for high-performance bulk inserts - this is the whole reason SQLAlchemy offers the Core in addition to the ORM as a first-class component. ORM 基本上不用于高性能批量插入——这就是 SQLAlchemy 提供 Core 以及 ORM 作为一流组件的全部原因。

For the use case of fast bulk inserts, the SQL generation and execution system that the ORM builds on top of is part of the Core.对于快速批量插入的用例,ORM 构建在其之上的 SQL 生成和执行系统是 Core 的一部分。 Using this system directly, we can produce an INSERT that is competitive with using the raw database API directly.直接使用这个系统,我们可以生成一个可以与直接使用原始数据库 API 竞争的 INSERT。

Alternatively, the SQLAlchemy ORM offers the Bulk Operations suite of methods, which provide hooks into subsections of the unit of work process in order to emit Core-level INSERT and UPDATE constructs with a small degree of ORM-based automation.或者,SQLAlchemy ORM 提供了批量操作方法套件,它提供了工作过程单元子部分的挂钩,以便以基于 ORM 的小程度自动化发出核心级 INSERT 和 UPDATE 构造。

The example below illustrates time-based tests for several different methods of inserting rows, going from the most automated to the least.下面的示例说明了几种不同的插入行方法的基于时间的测试,从最自动化到最不自动化。 With cPython 2.7, runtimes observed:使用 cPython 2.7,运行时观察到:

 classics-MacBook-Pro:sqlalchemy classic$ python test.py SQLAlchemy ORM: Total time for 100000 records 12.0471920967 secs SQLAlchemy ORM pk given: Total time for 100000 records 7.06283402443 secs SQLAlchemy ORM bulk_save_objects(): Total time for 100000 records 0.856323003769 secs SQLAlchemy Core: Total time for 100000 records 0.485800027847 secs sqlite3: Total time for 100000 records 0.487842082977 sec

Script:脚本:

 import time import sqlite3 from sqlalchemy.ext.declarative import declarative_base from sqlalchemy import Column, Integer, String, create_engine from sqlalchemy.orm import scoped_session, sessionmaker Base = declarative_base() DBSession = scoped_session(sessionmaker()) engine = None class Customer(Base): __tablename__ = "customer" id = Column(Integer, primary_key=True) name = Column(String(255)) def init_sqlalchemy(dbname='sqlite:///sqlalchemy.db'): global engine engine = create_engine(dbname, echo=False) DBSession.remove() DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False) Base.metadata.drop_all(engine) Base.metadata.create_all(engine) def test_sqlalchemy_orm(n=100000): init_sqlalchemy() t0 = time.time() for i in xrange(n): customer = Customer() customer.name = 'NAME ' + str(i) DBSession.add(customer) if i % 1000 == 0: DBSession.flush() DBSession.commit() print( "SQLAlchemy ORM: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs") def test_sqlalchemy_orm_pk_given(n=100000): init_sqlalchemy() t0 = time.time() for i in xrange(n): customer = Customer(id=i+1, name="NAME " + str(i)) DBSession.add(customer) if i % 1000 == 0: DBSession.flush() DBSession.commit() print( "SQLAlchemy ORM pk given: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs") def test_sqlalchemy_orm_bulk_insert(n=100000): init_sqlalchemy() t0 = time.time() n1 = n while n1 > 0: n1 = n1 - 10000 DBSession.bulk_insert_mappings( Customer, [ dict(name="NAME " + str(i)) for i in xrange(min(10000, n1)) ] ) DBSession.commit() print( "SQLAlchemy ORM bulk_save_objects(): Total time for " + str(n) + " records " + str(time.time() - t0) + " secs") def test_sqlalchemy_core(n=100000): init_sqlalchemy() t0 = time.time() engine.execute( Customer.__table__.insert(), [{"name": 'NAME ' + str(i)} for i in xrange(n)] ) print( "SQLAlchemy Core: Total time for " + str(n) + " records " + str(time.time() - t0) + " secs") def init_sqlite3(dbname): conn = sqlite3.connect(dbname) c = conn.cursor() c.execute("DROP TABLE IF EXISTS customer") c.execute( "CREATE TABLE customer (id INTEGER NOT NULL, " "name VARCHAR(255), PRIMARY KEY(id))") conn.commit() return conn def test_sqlite3(n=100000, dbname='sqlite3.db'): conn = init_sqlite3(dbname) c = conn.cursor() t0 = time.time() for i in xrange(n): row = ('NAME ' + str(i),) c.execute("INSERT INTO customer (name) VALUES (?)", row) conn.commit() print( "sqlite3: Total time for " + str(n) + " records " + str(time.time() - t0) + " sec") if __name__ == '__main__': test_sqlalchemy_orm(100000) test_sqlalchemy_orm_pk_given(100000) test_sqlalchemy_orm_bulk_insert(100000) test_sqlalchemy_core(100000) test_sqlite3(100000)

As far as I know, there is no way to get the ORM to issue bulk inserts.据我所知,没有办法让 ORM 发出批量插入。 I believe the underlying reason is that SQLAlchemy needs to keep track of each object's identity (ie, new primary keys), and bulk inserts interfere with that.我相信根本原因是 SQLAlchemy 需要跟踪每个对象的身份(即新的主键),而批量插入会干扰这一点。 For example, assuming your foo table contains an id column and is mapped to a Foo class:例如,假设您的foo表包含一个id列并映射到一个Foo类:

x = Foo(bar=1)
print x.id
# None
session.add(x)
session.flush()
# BEGIN
# INSERT INTO foo (bar) VALUES(1)
# COMMIT
print x.id
# 1

Since SQLAlchemy picked up the value for x.id without issuing another query, we can infer that it got the value directly from the INSERT statement.由于 SQLAlchemy 在没有发出另一个查询的情况下获取了x.id的值,我们可以推断它是直接从INSERT语句中获取的值。 If you don't need subsequent access to the created objects via the same instances, you can skip the ORM layer for your insert:如果您不需要通过相同的实例对创建的对象进行后续访问,您可以跳过插入的 ORM 层:

Foo.__table__.insert().execute([{'bar': 1}, {'bar': 2}, {'bar': 3}])
# INSERT INTO foo (bar) VALUES ((1,), (2,), (3,))

SQLAlchemy can't match these new rows with any existing objects, so you'll have to query them anew for any subsequent operations. SQLAlchemy 无法将这些新行与任何现有对象匹配,因此您必须重新查询它们以进行任何后续操作。

As far as stale data is concerned, it's helpful to remember that the session has no built-in way to know when the database is changed outside of the session.就陈旧数据而言,记住会话没有内置方法可以知道何时在会话之外更改数据库是有帮助的。 In order to access externally modified data through existing instances, the instances must be marked as expired .为了通过现有实例访问外部修改的数据,必须将实例标记为expired This happens by default on session.commit() , but can be done manually by calling session.expire_all() or session.expire(instance) .这在session.commit()上默认发生,但可以通过调用session.expire_all()session.expire(instance)手动完成。 An example (SQL omitted):一个例子(SQL省略):

x = Foo(bar=1)
session.add(x)
session.commit()
print x.bar
# 1
foo.update().execute(bar=42)
print x.bar
# 1
session.expire(x)
print x.bar
# 42

session.commit() expires x , so the first print statement implicitly opens a new transaction and re-queries x 's attributes. session.commit()使x过期,因此第一个打印语句隐式地打开一个新事务并重新查询x的属性。 If you comment out the first print statement, you'll notice that the second one now picks up the correct value, because the new query isn't emitted until after the update.如果您注释掉第一个打印语句,您会注意到第二个现在选择了正确的值,因为直到更新之后才会发出新查询。

This makes sense from the point of view of transactional isolation - you should only pick up external modifications between transactions.从事务隔离的角度来看,这是有道理的 - 您应该只在事务之间进行外部修改。 If this is causing you trouble, I'd suggest clarifying or re-thinking your application's transaction boundaries instead of immediately reaching for session.expire_all() .如果这给您带来了麻烦,我建议您澄清或重新考虑应用程序的事务边界,而不是立即访问session.expire_all()

Direct support was added to SQLAlchemy as of version 0.8从 0.8 版开始,对 SQLAlchemy 添加了直接支持

As per the docs , connection.execute(table.insert().values(data)) should do the trick.根据文档connection.execute(table.insert().values(data))应该可以解决问题。 (Note that this is not the same as connection.execute(table.insert(), data) which results in many individual row inserts via a call to executemany ). (注意,这是一样的connection.execute(table.insert(), data) ,其经由到调用导致许多个体行插入executemany )。 On anything but a local connection the difference in performance can be enormous.除了本地连接之外,在任何情况下,性能差异都可能是巨大的。

I usually do it using add_all .我通常使用add_all来做到这add_all

from app import session
from models import User

objects = [User(name="u1"), User(name="u2"), User(name="u3")]
session.add_all(objects)
session.commit()

SQLAlchemy introduced that in version 1.0.0 : SQLAlchemy 在1.0.0版中介绍了这一点:

Bulk operations - SQLAlchemy docs 批量操作 - SQLAlchemy 文档

With these operations, you can now do bulk inserts or updates!通过这些操作,您现在可以进行批量插入或更新!

For instance (if you want the lowest overhead for simple table INSERTs), you can use Session.bulk_insert_mappings() :例如(如果您希望简单表插入的开销最低),您可以使用Session.bulk_insert_mappings()

loadme = [(1, 'a'),
          (2, 'b'),
          (3, 'c')]
dicts = [dict(bar=t[0], fly=t[1]) for t in loadme]

s = Session()
s.bulk_insert_mappings(Foo, dicts)
s.commit()

Or, if you want, skip the loadme tuples and write the dictionaries directly into dicts (but I find it easier to leave all the wordiness out of the data and load up a list of dictionaries in a loop).或者,如果你愿意,跳过loadme元组和直接写字典入dicts (但我觉得它更容易让所有的废话出来的数据,并加载词典列表中的循环)。

Piere's answer is correct but one issue is that bulk_save_objects by default does not return the primary keys of the objects, if that is of concern to you.皮埃尔的回答是正确的,但一个问题是,默认情况下, bulk_save_objects不返回对象的主键,如果您担心的话。 Set return_defaults to True to get this behavior.return_defaults设置为True以获得此行为。

The documentation is here .文档在这里

foos = [Foo(bar='a',), Foo(bar='b'), Foo(bar='c')]
session.bulk_save_objects(foos, return_defaults=True)
for foo in foos:
    assert foo.id is not None
session.commit()

All Roads Lead to Rome , but some of them crosses mountains, requires ferries but if you want to get there quickly just take the motorway.条条大路通罗马,但其中一些要穿越山脉,需要渡轮,但如果您想快速到达那里,只需走高速公路即可。


In this case the motorway is to use the execute_batch() feature of psycopg2 .在这种情况下,高速公路将使用psycopg2execute_batch()功能。 The documentation says it the best:文档说的是最好的:

The current implementation of executemany() is (using an extremely charitable understatement) not particularly performing. executemany()的当前实现(使用极其慈善的轻描淡写)不是特别有效。 These functions can be used to speed up the repeated execution of a statement against a set of parameters.这些函数可用于加速针对一组参数的语句的重复执行。 By reducing the number of server roundtrips the performance can be orders of magnitude better than using executemany() .通过减少服务器往返次数,性能可以比使用executemany()executemany()数量级。

In my own test execute_batch() is approximately twice as fast as executemany() , and gives the option to configure the page_size for further tweaking (if you want to squeeze the last 2-3% of performance out of the driver).在我自己的测试中, execute_batch()速度大约executemany()两倍,并且提供了配置 page_size 以进行进一步调整的选项(如果您想从驱动程序中挤出最后 2-3% 的性能)。

The same feature can easily be enabled if you are using SQLAlchemy by setting use_batch_mode=True as a parameter when you instantiate the engine with create_engine()如果您使用 SQLAlchemy,在使用create_engine()实例化引擎时将use_batch_mode=True设置为参数,则可以轻松启用相同的功能

This is a way:这是一种方式:

values = [1, 2, 3]
Foo.__table__.insert().execute([{'bar': x} for x in values])

This will insert like this:这将像这样插入:

INSERT INTO `foo` (`bar`) VALUES (1), (2), (3)

Reference: The SQLAlchemy FAQ includes benchmarks for various commit methods.参考:SQLAlchemy常见问题解答包括各种提交方法的基准。

The best answer I found so far was in sqlalchemy documentation:到目前为止,我找到的最佳答案是在 sqlalchemy 文档中:

http://docs.sqlalchemy.org/en/latest/faq/performance.html#im-inserting-400-000-rows-with-the-orm-and-it-s-really-slow http://docs.sqlalchemy.org/en/latest/faq/performance.html#im-inserting-400-000-rows-with-the-orm-and-it-s-really-slow

There is a complete example of a benchmark of possible solutions.有一个完整的可能解决方案基准示例。

As shown in the documentation:如文档所示:

bulk_save_objects is not the best solution but it performance are correct. bulk_save_objects 不是最好的解决方案,但它的性能是正确的。

The second best implementation in terms of readability I think was with the SQLAlchemy Core:我认为在可读性方面第二好的实现是使用 SQLAlchemy Core:

def test_sqlalchemy_core(n=100000):
    init_sqlalchemy()
    t0 = time.time()
    engine.execute(
        Customer.__table__.insert(),
            [{"name": 'NAME ' + str(i)} for i in xrange(n)]
    )

The context of this function is given in the documentation article.该函数的上下文在文档文章中给出。

Sqlalchemy supports bulk insert Sqlalchemy 支持批量插入

bulk_list = [
    Foo(
        bar=1,
    ),
    Foo(
        bar=2,
    ),
    Foo(
        bar=3,
    ),
]
db.session.bulk_save_objects(bulk_list)
db.session.commit()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM