简体   繁体   English

将约9000万条记录的MySQL表中的数据迁移到另一个数据库

[英]Migrating data from a ~90 million record MySQL table to another database

In the past week I've been trying to migrate a database containing approximately 90 million rows from MySQL to a newly created Couchbase instance. 在过去的一周中,我一直在尝试将包含大约9000万行的数据库从MySQL迁移到新创建的Couchbase实例。 I've researched the web for possible solutions for doing so and found some tools which ultimately failed due to low memory availability. 我已经在网络上研究了这样做的可能解决方案,并发现一些工具由于内存不足而最终失败了。 I also read about partitioning but I'm no expert in MySQL administration so this seemed like an over-reach for my abilities at the moment. 我也读过有关分区的信息,但是我不是MySQL管理方面的专家,因此目前看来这似乎超出了我的能力。 Eventually I decided to implement my own designated script which would select a certain amount of data from the existing MySQL table, serialize it for Couchbase's newly created bucket and insert it there. 最终,我决定实现自己的指定脚本,该脚本将从现有的MySQL表中选择一定数量的数据,将其序列化为Couchbase的新创建的存储桶,然后将其插入其中。 The tool works great for the first 5 million records, but then the instance of MySQL takes way too long to retrieve further records. 该工具非常适合前500万条记录,但是MySQL实例花费的时间太长,无法检索更多记录。

It is worth mentioning that the MySQL table I'm working is only being used by me, thus no changes are being made during the migration process. 值得一提的是,我正在使用的MySQL表仅由我使用,因此在迁移过程中未进行任何更改。

The script I built leverages the LIMIT OFFSET statement as stated in the Select Syntax Documentation and looks like this: 我构建的脚本利用了“ 选择语法文档”中所述的LIMIT OFFSET语句,如下所示:

SELECT * FROM data LIMIT ?,?

Where ?,? 在哪里?,? is generated by increasing the starting point of the selection by a certain amount of records. 通过将选择的起点增加一定数量的记录来生成。 For example, the following are possible queries done by a single migration process: 例如,以下是由单个迁移过程完成的可能查询:

SELECT * FROM data LIMIT 0,100000
SELECT * FROM data LIMIT 100000,200000
SELECT * FROM data LIMIT 200000,300000
...

The migration process will stop when no records are retrieved. 如果未检索到任何记录,则迁移过程将停止。 As I previously stated, the queries which select records starting from position of about 5 million are taking too long and make the migration process undoable. 如我之前所述,从大约500万个位置开始选择记录的查询花费的时间太长,并且使迁移过程无法进行。 I'm no database expert and have done nothing other than creating a new MySQL database and tables via MySQL Workbench 6.3 CE and no optimizations have been made on my data. 我不是数据库专家,除了通过MySQL Workbench 6.3 CE创建新的MySQL数据库和表外,没有做任何其他事情,并且我的数据没有进行任何优化。 The table I'm trying to migrate contains one column which acts as a key, non-null, and has a unique value. 我尝试迁移的表包含一个列,该列用作键,非null,并且具有唯一值。 All other columns have no options enabled on them. 所有其他列均未启用任何选项。

I would like to know if there is any other way for me to select the data sequentially so it could be inserted without duplicates or corruption. 我想知道是否还有其他方法可以按顺序选择数据,以便可以将其插入而不会重复或损坏。 Any help on this matter is greatly appreciated! 非常感谢在此问题上的任何帮助!

You are wrongly doing the pagination. 您错误地进行了分页。 See Using MySQL LIMIT to Constrain The Number of Rows Returned By SELECT Statement 请参见使用MySQL LIMIT约束SELECT语句返回的行数

The following illustrates the LIMIT clause syntax with two arguments: 下面说明了带有两个参数的LIMIT子句语法:

SELECT 
    column1,column2,...
FROM
    table
LIMIT offset , count;
  • The offset specifies the offset of the first row to return. 偏移量指定要返回的第一行的偏移量。 The offset of the first row is 0, not 1. 第一行的偏移量是0,而不是1。
  • The count specifies the maximum number of rows to return. 该计数指定要返回的最大行数。

So you should have a fixed page size (count) and a variable offset with no overlaping. 因此,您应该具有固定的页面大小(计数)和可变的偏移量且没有重叠。

SELECT * FROM data LIMIT 0,100000
SELECT * FROM data LIMIT 100000,100000
SELECT * FROM data LIMIT 200000,100000
....
SELECT * FROM data LIMIT 89900000,100000

I guess MySQL starts taking an unusably long time to satisfy your LIMIT clauses when their numbers get larger. 我猜想当MySQL的LIMIT子句数目变大时,它会花很长时间来满足您的LIMIT子句。 LIMIT does that. LIMIT做到了。

You'll have much better luck using an indexed colum to select each segment of your table to export. 使用索引的列来选择要导出的表的每个部分,您的运气会更好。 There's no harm done if some segments contain fewer rows than others. 如果某些段包含的行少于其他段,则没有任何危害。

For example you could do 例如你可以

SELECT * FROM data WHERE datestamp >= '2017-01-01' AND datestamp < '2017-02-01';
SELECT * FROM data WHERE datestamp >= '2017-02-01' AND datestamp < '2017-03-01';
SELECT * FROM data WHERE datestamp >= '2017-03-01' AND datestamp < '2017-04-01';
SELECT * FROM data WHERE datestamp >= '2017-04-01' AND datestamp < '2017-05-01';
SELECT * FROM data WHERE datestamp >= '2017-05-01' AND datestamp < '2017-06-01';
SELECT * FROM data WHERE datestamp >= '2017-06-01' AND datestamp < '2017-07-01';
 ...

to break out your records by calendar month (assuming you have a datestamp column). 按日历月datestamp您的记录(假设您有一个datestamp列)。

Or, if you have an autoincrementing primary key id column try this 或者,如果您有自动递增的主键id列,请尝试此操作

SELECT * FROM data WHERE                 id < 100000;
SELECT * FROM data WHERE id>= 100000 AND id < 200000;
SELECT * FROM data WHERE id>= 200000 AND id < 300000;
SELECT * FROM data WHERE id>= 300000 AND id < 400000;
SELECT * FROM data WHERE id>= 400000 AND id < 500000;
SELECT * FROM data WHERE id>= 500000 AND id < 600000;
 ...

An entirely diffent approach that will still work. 完全不同的方法仍然有效。 In your dumping program do 在您的转储程序中

 SELECT * FROM data;

then have the program switch to another output file every n records. 然后每n条记录将程序切换到另一个输出文件。 For example, pseudo code 例如,伪代码

 rowcount = 100000
 rownum = 0
 rowsleft = rowcount
 open file 'out' + 000000;
 while next input record available {
     read record
     write record
     rownum = rownum + 1
     rowsleft = rowsleft - 1
     if rowsleft <= 1 {
        close file
        open file 'out' + rownum
        rowsleft = rowcount
     }
  }
  close file

This will use a single MySQL query, so you won't have to worry about segments. 这将使用单个MySQL查询,因此您不必担心段。 It should be quite fast. 它应该很快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM