简体   繁体   中英

MySQL: Best way to update a large table

I have a table with huge amount of data. The source of data is an external api. Every few hours, I need to sync the database so that the changes are up to date from the external api. I am doing a full sync (api doesn't allow delta sync).

While sync happens, I want to make sure that the data from the database is also available for read. So, I am following below steps:

  1. I have a cloumn in the table which acts as a flag for whether or not data is readable. Only the data with flag set is marked for read.
  2. I am inserting all the data from the api into the table.
  3. Once all the data is written, I am deleting all the data in the table with flag set.
  4. After deletion, I am updating the table and setting the flag for all the rows.

Table has around ~50 million rows and is expected to grow. There is a customerId field in the table. Sync usually happens based on customerId by passing it to the api.

My problem is, step 3 and 4 above are taking a lot of time. Queries are something like:

Step 3 --> delete from foo where customer_id=12345678 and flag=1

Step 4 --> update foo set flag=1 where customer_id=12345678

I have tried partitioning the table based on customer_id and it works great where customer_id has less number of rows but for some customer_id, the number of rows in each partition itself goes till ~5 million.

Around 90% of data doesn't change between two syncs. How can I make this fast?

I was thinking of using just the update queries instead of insert queries and then check if there was any update. If not, I can issue an insert query for the same row. This way any updates will be taken care of along with the insert. But I am not sure if the operation will block read queries for this while update is in progress.

For your setup (read only data, full sync), the fastest way to update the table is to not update at all, but to import the data into a different table and to rename it afterwards to make it the new table.

Create a table like your original table, eg use

create table foo_import like foo;

If you have eg triggers, add them too.

From now on, let the import api write its (full) sync to this new table.

After a sync is done, swap the two tables:

RENAME TABLE foo TO foo_tmp, 
    foo_import TO foo, 
    foo_tmp to foo_import;

It will (literally) just require a second.

This command is atomic: it will wait for transactions that access these tables to finish, it will not present a situation where there is no table foo and it will completely fail (and not do anything) if one of the tables doesn't exist or foo_tmp already exists.

As a final step, empty your import table (that now contains your old data) to be ready for your next import:

truncate foo_import;

This will again just require a second.

The rest of your querys probably assume that flag=1 . Until (if at all) you update the code to not use the flag anymore, you can set its default value to 1 to keep it compatible, eg use

alter table foo modify column flag tinyint default 1;

Since you don't have foreign keys, it doesn't have to bother you, but for others with a similar problem it might be useful to know that foreign keys will get adjusted, so foreign keys that are referencing foo will reference foo_import after renaming the tables. To make them point to the new table foo again, they have to be dropped and recreated. Everything else (eg views, queries, procedures) will resolve by the current name, so they will always access the current foo .

CREATE TABLE new LIKE real;
Load `new` by whatever means you have; take as long as needed.
RENAME TABLE real TO old, new TO real;
DROP TABLE old;

The RENAME is atomic and "instantaneous"; real is "always" available.

(I don't see the need for flag .)

OR...

Since you are actually updating a chunk of a table, consider these...

If the chunk is small...

  1. Load the new data into a tmp table
  2. DELETE the old rows
  3. INSERT ... SELECT ... to move the new rows in. (Having the new data already in a table is probably the fastest way to achieve this.)

If the chunk is big, and you don't want to lock the table for "too long", there are some other tricks. But first, is there some form of unique row number for each row for the customer? (I'm thinking about batch-moving a bunch or rows at a time, but need more specifics before spelling it out.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM