简体   繁体   English

使用 python 清除大量数据 mysql 表

[英]purging a huge data mysql table using python

I have a 1000M data table where i need to have a automated script just keeping last 7 days and delete the before days.我有一个 1000M 的数据表,我需要一个自动脚本只保留最后 7 天并删除前几天。 I want to do it using python and chunks concept.我想使用 python 和块概念来做到这一点。 WANT TO DELETE CHUNK WISE.I have 2 dbts 1.do we have any library with this cunk concept related to mysql on python 2. If no can any one suggest me a best method of how to use chunk or apply this with mysql想要明智地删除块。我有 2 个 dbts 1.我们是否有任何与 python 上的 mysql 相关的笨拙概念的库 2. 如果没有人可以建议我如何使用块的最佳方法或将其与 mysql 一起应用。

I'm unaware of a Python package that has an API for "chunking" deletes from a MySQL table.我不知道一个 Python package 有一个 API 用于“分块”从 ZDCBB673AZ44 表中删除。 SqlAlchemy provides a fluent interface that can do this but it's not much different than the SQL. SqlAlchemy 提供了一个流畅的接口,可以做到这一点,但它与 SQL 并没有太大区别。 I suggest using PyMySql.我建议使用 PyMySql。

import datetime

import pymysql.cursors


connection = pymysql.connect(
    host='host',
    user='user',
    password='password',
    database='database'
)
seven_days_before_now = datetime.datetime.now() - datetime.timedelta(days=7)
chunksize = 1000
with connection.cursor() as cursor:
    sql = 'DELETE FROM `mytable` WHERE `timestamp` < %s ORDER BY `id` LIMIT %s;'
    num_deleted = None
    while num_deleted != 0:
        num_deleted = cursor.execute(sql, (seven_days_before_now, chunksize))
        connection.commit()

The LIMIT just limits the number of deleted rows to the chunksize . LIMIT只是将删除的行数限制为chunksize The ORDER BY ensures that the DELETE is deterministic and it sorts by the primary key because the primary key is guaranteed to be indexed; ORDER BY确保DELETE是确定性的,并且它按主键排序,因为主键保证被索引; so even though it sorts for each chunk, at least it's sorting on an indexed column.所以即使它对每个块进行排序,至少它是在索引列上排序的。 Remove the ORDER BY if deterministic behavior is unnecessary, it will result in faster execution time.如果不需要确定性行为,请删除ORDER BY ,这将导致更快的执行时间。 You'll need to replace the connection details, table name, column name and chunksize .您需要替换连接详细信息、表名、列名和chunksize Also, this solution assumes that the table has a column named id which is the primary key and an auto-incrementing integer.此外,此解决方案假定表有一个名为id的列,它是主键和一个自动递增的 integer。 You'll need to make some changes if your schema differs.如果您的架构不同,您需要进行一些更改。

As Bernd Buffen commented: the correct way to get the behavior you desire is to partition the table.正如 Bernd Buffen 评论的那样:获得所需行为的正确方法是对表进行分区。 Please consider a migration to do so.请考虑迁移。

And, of course: stop using Python 2, it's been unsupported for almost two years as of the first version of this answer.而且,当然:停止使用 Python 2,从这个答案的第一个版本开始,它已经不受支持了将近两年。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM