简体   繁体   English

SQLAlchemy Core批量插入速度慢

[英]SQLAlchemy Core bulk insert slow

I'm trying to truncate a table and insert only ~3000 rows of data using SQLAlchemy, and it's very slow (~10 minutes). 我试图使用SQLAlchemy截断一个表并仅插入约3000行数据,这非常慢(约10分钟)。

I followed the recommendations on this doc and leveraged sqlalchemy core to do my inserts, but it's still running very very slow. 我遵循了此文档上的建议,并利用了sqlalchemy内核来进行插入,但是运行速度仍然非常慢。 What are possible culprits for me to look at? 我有哪些可能的罪魁祸首? Database is a postgres RDS instance. 数据库是一个postgres RDS实例。 Thanks! 谢谢!

engine = sa.create_engine(db_string, **kwargs, pool_recycle=3600)
with engine.begin() as conn:
            conn.execute("TRUNCATE my_table")
            conn.execute(
                MyTable.__table__.insert(),
                data #where data is a list of dicts
            )

I was bummed when I saw this didn't have an answer... I ran into the exact same problem the other day: Trying to bulk-insert about millions of rows to a Postgres RDS Instance using CORE. 当我看到这没有答案时,我感到无比沮丧。前几天,我遇到了完全相同的问题:尝试使用CORE将数百万行插入到Postgres RDS实例中。 It was taking hours . 这花了几个小时

As a workaround, I ended up writing my own bulk-insert script that generated the raw sql itself: 作为一种解决方法,我最终编写了自己的生成原始sql的大容量插入脚本:

bulk_insert_str = []
for entry in entry_list:
    val_str = "('{}', '{}', ...)".format(entry["column1"], entry["column2"], ...)
    bulk_insert_str.append(val_str)

engine.execute(
    """
    INSERT INTO my_table (column1, column2 ...)
    VALUES {}
    """.format(",".join(bulk_insert_str))
)

While ugly, this gave me the performance we needed (~500,000 rows/minute) 虽然很丑陋,但这却给了我所需的性能(〜500,000行/分钟)

Did you find a CORE-based solution? 您找到基于CORE的解决方案了吗? If not, hope this helps! 如果没有,希望对您有所帮助!

UPDATE: Ended up moving my old script into a spare EC2 instance that we weren't using which actually fixed the slow performance issue. 更新:最终将我的旧脚本移到了我们未使用的备用EC2实例中,该实例实际上解决了性能降低的问题。 Not sure what your setup is, but apparently there's a network overhead in communicating with RDS from an external (non-AWS) connection. 不确定您的设置是什么,但是从外部(非AWS)连接与RDS进行通信显然会产生网络开销。

Some time ago I had been struggling with the problem while working in the company, so we had created a library with functions for bulk insert and update. 前一段时间,我在公司工作时一直在为这个问题而苦苦挣扎,因此我们创建了一个具有批量插入和更新功能的库。 Hope we've taken into account all performance and security concerns. 希望我们已考虑到所有性能和安全性问题。 This library is open-sourced and available on PyPI, its name: bulky . 该库是开源的,可在PyPI上使用,其名称为: bulky

Let me show you some examples of usage: 让我向您展示一些用法示例:

insert: 插入:

import bulky
from your.sqlalchemy.models import Model
from your.sqlalchemy.session import Session

data = [
    {Model.column_float: random()}
    for _ in range(100_000_000)
]

rows_inserted = bulky.insert(
    session=Session,
    table_or_model=Model,
    values_series=data,
    returning=[Model.id, Model.column_float]
)

new_items = {row.id: row.column_float for row in rows_inserted}

update: 更新:

import bulky
from your.sqlalchemy.models import ManyToManyTable
from your.sqlalchemy.session import Session

data = [
    {
        ManyToManyTable.fk1: i,
        ManyToManyTable.fk2: j,
        ManyToManyTable.value: i + j,
    }
    for i in range(100_000_000)
    for j in range(100_000_000)
]

rows_updated = bulky.update(
    session=Session,
    table_or_model=ManyToManyTable,
    values_series=data,
    returning=[
        ManyToManyTable.fk1,
        ManyToManyTable.fk2,
        ManyToManyTable.value,],
    reference=[
        ManyToManyTable.fk1,
        ManyToManyTable.fk2,],
)

updated_items = {(row.fk1, row.fk2): row.value for row in rows_updated}

Not sure if links are allowed, so I'll put them under spoiler 不确定是否允许链接,所以我将其置于破坏者之下

Readme and PyPI 自述文件PyPI

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM