简体   繁体   English

使用 sqlalchemy 从 mysql 获取大量数据的最佳方法是什么?

[英]What is the best way to fetch huge data from mysql with sqlalchemy?

I want to process over 10 millions data stored in MySQL.我想处理存储在 MySQL 中的超过 1000 万条数据。 So I wrote this to slice the sql to several parts then concatenate the data for latter process.所以我写这个是为了将 sql 分成几个部分,然后将数据连接到后面的过程中。 It works well if count < 2 millions .如果count < 2 millions ,则效果很好。 However when the count rise, the time sqlalchemy consumes goes much longer.但是当count上升时,sqlalchemy 消耗的时间会更长。

def fetch_from_sql(_sql_pat, count):
    """
    :param _sql_pat: SELECT id, data FROM a.b LIMIT {},{};
    :param count: how many data you want to fetch from mysql
    :return: generator
    """
    def gen_connect(sql):
        __engine = create_engine(db_config['SQLALCHEMY_DATABASE_URI'])
        with __engine.connect() as c:
            for row in c.execute(sql)
                yield row

    def gen_range(limit, step):
        if step > limit:
            yield 0, limit
        else:
            R = range(0, limit + 1, step)
            for idx, v in enumerate(R):
                if idx == 0:
                    yield v, step
                elif limit - v >= step:
                    yield v + 1, step
                else:
                    yield v + 1, limit - v

    sqls = [_sql_pat.format(start, step) for start, step in gen_range(count, 100000)]
    sources = (gen_connect(sql) for sql in sqls)
    for s in sources:
        for item in s:
            yield item
        gc.collect()

The question is why the sqlalchemy take more and more time (I logged time and post below), and what is the best way to deal with this situation?问题是为什么 sqlalchemy 花费越来越多的时间(我记录了时间并在下面发布),以及处理这种情况的最佳方法是什么?

Dumped 10000 items, at 2016-10-08 11:55:33
Dumped 1000000 items, at 2016-10-08 11:59:23
Dumped 2000000 items, at 2016-10-08 12:05:07
Dumped 3000000 items, at 2016-10-08 13:54:05

This is because you're using LIMIT / OFFSET , so when you specify offset 3000000, for example, the database has to skip over 3000000 records.这是因为您使用的是LIMIT / OFFSET ,因此当您指定偏移量 3000000 时,例如,数据库必须跳过 3000000 条记录。

The correct way to do this is to ORDER BY some indexed column, like the primary key id column, for example, then do a WHERE id > :last_fetched_id .执行此操作的正确方法是按某些索引列(例如主键id列)进行ORDER BY ,然后执行WHERE id > :last_fetched_id

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Reddit 获取所有数据(帖子及其评论)的最佳方式是什么? - What is the best way to fetch all data (posts and their comments) from Reddit? 从 oracle 表中读取大量数据并提取到数据框中的最佳方法是什么 - What is an optimum way to read huge data from oracle table and fetch into a data frame 将dbfpy中的数据插入MySQL的最佳方法是什么? - What's the best way to insert data from dbfpy into MySQL? 在MySQL / SQLAlchemy中计算事件持续时间的最佳方法? - Best way to calculate duration of the event in MySQL / SQLAlchemy? 在 SQLAlchemy 中模拟断开连接的最佳方法是什么 - What is the best way to mock disconnect within SQLAlchemy 在 SQLAlchemy 中填充表的最佳方法是什么? - What is the best way to populate a table in SQLAlchemy? 从pdf中提取数据的最佳方法是什么 - what is the best way to extract data from pdf 如何使用 flask_sqlalchemy 从 MySQL 获取 varchar 类型数据? - How to fetch varchar type data from MySQL using flask_sqlalchemy? models.py变得越来越大,打破它的最佳方法是什么? - models.py getting huge, what is the best way to break it up? 在SQLalchemy模型中存储运行时信息的最佳方法是什么? - What is the best way to store runtime information in SQLalchemy model?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM