简体   繁体   English

从mysql表中清除数据

[英]Purging data from mysql tables

I have a cron setup to take a backup of production mysql tables and looking to purge data from the tables at regular intervals. 我有一个cron设置,用于备份生产mysql表,并希望定期清除表中的数据。 I have to delete data across multiple tables referenced by ids. 我必须删除ID引用的多个表中的数据。

Some background : I need to delete about 2 million rows and my app will be continuously reading/writing to my db(it shouldn't usually access the rows being deleted though) 一些背景:我需要删除大约200万行,并且我的应用程序将不断读取/写入我的数据库(尽管它通常不应该访问要删除的行)

My question is how should I structure my delete query on the following parameters : 我的问题是如何在以下参数上构造删除查询:

  1. Delete in a single bulk query vs deleting in batches ? 在单个批量查询中删除还是批量删除?
  2. Delete across different tables in a single transaction vs deleting without using any transaction. 在单个事务中跨不同表进行删除与不使用任何事务进行删除相比。 Will there be any table level locks if I use delete in transactions even if I delete in batches? 即使我批量删除,如果在事务中使用Delete,是否会有表级锁?
  3. I do not have any partitions set up, would fragmentation be an issue? 我没有设置任何分区,碎片会成为问题吗?

Assumption: 假设:

  1. Isolation level : Repeatable Read -- Default Mysql Isolation Level. 隔离级别:可重复读取-默认Mysql隔离级别。
  2. Delete query which you have is based on range and not primary index. 删除基于范围而不是主索引的查询。

  3. Deleting all rows in one transaction, Will have very long transaction, and a larger locks. 删除一个事务中的所有行,将具有很长的事务和更大的锁。 This ll increase replication lag, replication lag is bad, new DC makes it really bad. 这会增加复制滞后,复制滞后不好,新的DC使其变得很糟糕。 Having larger locks also will reduce your write throughput. 拥有更大的锁也将降低您的写入吞吐量。 (In case of Isolation Level Serializable even reads throughput might also suffer.) (在隔离级别可序列化的情况下,甚至读取吞吐量也可能会受到影响。)

  4. Deleting in batch. 批量删除。 Better than deleting all, but as deletes are happening for range, number of locks for each delete will be more, (will take gap locks and next row locks). 比全部删除要好,但是随着范围的删除发生,每次删除的锁数会更多(将使用间隙锁和下一行锁)。 So delete in batch on range will also have same problems just smaller. 因此,按范围批量删除也将具有较小的相同问题。

Compared to delete in all and batch, doing it in batch is preferable. 与全部删除相比,批处理更可取。

Other way of doing : (We need to delete rows before some-time ) 1. Have a daemon which runs every configured_time and. 其他方法:(我们需要在某个时间之前删除行)1.有一个守护进程,它运行每个configure_time和。 i. 一世。 select pk from table where purge-time < your-purge-time. 从表中选择pk,其中清除时间<您的清除时间。 -- no locks ii. -无锁ii。 delete based on pk, using multiple threads. 使用多个线程基于pk删除。 -- row level locks, small transaction (across tables.) -行级锁,小事务(跨表)。

This approach will ensure smaller transaction and only row level locks. 这种方法将确保较小的事务,并且仅行级锁。 (delete based on primary key would only take row level locks). (基于主键的删除只会采用行级锁)。 Also your query is simple so you can re run even when part of deletes are successful. 您的查询也很简单,因此即使部分删除成功,您也可以重新运行。 And I feel having these atomic is not a requirement. 我觉得这些原子不是必需的。

Or 要么

  1. Reduce your isolation level : To READ_COMMITED then even, with batch deletes you should be fine. 降低隔离级别:对于READ_COMMITED,即使是批量删除,也应该没问题。 In Read COMMITED isolations, locks are only on row even while accessing via secondary key. 在Read COMMITED隔离中,即使通过辅助键访问,锁也仅在行上。

Or 要么

  1. If your model allows shard based on time and drop the db itself :) 如果您的模型允许基于时间的分片并删除数据库本身:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM