简体   繁体   English

如何在 MySql 中更新数百万条记录?

[英]How to update millions of records in MySql?

I have two tables tableA and tableB.我有两个表 tableA 和 tableB。 tableA has 2 Million records and tableB has over 10 millions records. tableA 有 200 万条记录,tableB 有超过 1000 万条记录。 tableA has more than thirty columns whereas tableB has only two column. tableA 有三十多列,而 tableB 只有两列。 I need to update a column in tableA from tableB by joining both tables.我需要通过连接两个表来从 tableB 更新 tableA 中的列。

UPDATE tableA a 
INNER JOIN tableB b  ON a.colA=b.colA
 SET a.colB= b.colB 

colA in both table has been indexed.两个表中的 colA 都已编入索引。

Now when I execute the query it takes hours.现在,当我执行查询时,它需要几个小时。 Honestly I never saw it completed and max i have waited is 5 hours.老实说,我从来没有看到它完成,我最多等了 5 个小时。 Is their any way to complete this query within 20-30 minutes.他们有什么方法可以在 20-30 分钟内完成此查询。 What approach should I take.我应该采取什么方法。

EXPLAIN on SQL Query在 SQL 查询上解释

"id" "_type" "table" "type" "possible_" "key" "key_len"   "ref"   "rows" "Extra"
"1" "SIMPLE" "a"     "ALL"  "INDX_DESC" \N    \N          \N   "2392270"  "Using where"
"1" "SIMPLE" "b"     "ref"  "indx_desc" "indx_desc" "133" "cis.a.desc" "1"  "Using where"

Your UPDATE operation is performing a single transaction on ten million rows of a large table. 您的UPDATE操作正在对大型表的一千万行执行单个事务。 (The DBMS holds enough data to roll back the entire UPDATE query if it does not complete for any reason.) A transaction of that size is slow for your server to handle. (DBMS保留足够的数据,如果由于某种原因未完成,则可以回滚整个UPDATE查询。)对于您的服务器而言,这种大小的事务处理速度很慢。

When you process entire tables, the operation can't use indexes as well as it can when it has highly selective WHERE clauses. 当处理整个表时,该操作不能像使用具有高度选择性的WHERE子句时那样使用索引。

A few things to try: 可以尝试的几件事:

1) Don't update rows unless they need it. 1)除非需要,否则不要更新行。 Skip the rows that already have the correct value. 跳过已经具有正确值的行。 If most rows already have the correct value this will make your update much faster. 如果大多数行已经具有正确的值,这将使您的更新快得多。

    UPDATE tableA a 
INNER JOIN tableB b  ON a.colA=b.colA
       SET a.colB = b.colB
     WHERE a.colB <> b.colB 

2) Do the update in chunks of a few thousand rows, and repeat the update operation until the whole table is updated. 2)以几千行的块为单位进行更新,然后重复更新操作,直到整个表被更新为止。 I guess tableA contains an id column. 我猜tableA包含一个id列。 You can use it to organize the chunks of rows to update. 您可以使用它来组织要更新的行块。

    UPDATE tableA a 
INNER JOIN tableB b  ON a.colA=b.colA
       SET a.colB = b.colB
     WHERE a.id IN  (
             SELECT a.id
               FROM tableA
              INNER JOIN tableB ON a.colA = b.colA
              WHERE a.colB <> b.colB
              LIMIT 5000
      ) 

The subquery finds the id values of 5000 rows that haven't yet been updated, and the UPDATE query updates them. 子查询查找尚未更新的5000行的id值,然后UPDATE查询更新它们。 Repeat this query until it changes no rows, and you're done. 重复此查询,直到不更改任何行,然后完成。 This makes things faster because the server must only handle smaller transactions. 这使事情变得更快,因为服务器只能处理较小的事务。

3) Don't do the update at all. 3)完全不执行更新。 Instead, whenever you need to retrieve your colB value, simply join to tableB in your select query. 相反,无论何时需要检索colB值,只需在选择查询中加入tableB。

Chunking is the right way to go. 分块是正确的方法。 However, chunk on the PRIMARY KEY of tableA . 但是, tableAPRIMARY KEY上的块。

I suggest only 1000 rows at a time. 我建议一次仅1000行。

Follow the tips given here 请遵循此处给出的提示

Did you say that the PK of tableA is a varchar? 您是否说过tableA的PK是varchar? No problem. 没问题。 See the second flavor of code in that link; 请参阅该链接中的第二种代码。 it uses ORDER BY id LIMIT 1000,1 to find the end of the next chunk, regardless of the datatype of id (the PK). 它使用ORDER BY id LIMIT 1000,1查找下一个块的末尾,而不管id的数据类型(PK)如何。

For updating around 70 million records of a single MySQL table, I wrote a stored procedure to update the table in chunks of 5000. Took approximately 3 hours to complete. 为了更新单个MySQL表的大约7000万条记录,我编写了一个存储过程以5000个数据块的形式更新该表。大约需要3个小时才能完成。

DELIMITER $$
DROP PROCEDURE IF EXISTS update_multiple_example_proc$$
CREATE PROCEDURE update_multiple_example_proc()
BEGIN
DECLARE x  bigint;

SET x = 1;

WHILE x  <= <MAX_PRIMARY_KEY_TO_REACH> DO
UPDATE tableA A
   JOIN tableB B
   ON A.col1 = B.col1
SET A.col2_to_be_updated = B.col2_to_be_updated where A.id between x and x+5000 ;
SET  x = x + 5000;
END WHILE;

END$$
DELIMITER ;

Hi i am not sure but you can do by cron job. 嗨,我不确定,但是您可以通过cron做事。 process: in table tableA you need to add one more field (for example) is_update set its default value is 0, set the cron job every min. 流程:在表tableA中,您需要再添加一个字段(例如)is_update,将其默认值设置为0,每分钟设置一次cron作业。 when cron is working: for example it pick first time 10000 record having is_update field 0 value and update records and set is_update is1, in 2nd time its pick next 10000 have is_update 0 and so on... Hope this will help to you. 当cron工作时:例如,它第一次选择具有is_update字段0值的10000条记录,并更新记录并设置is_update is1,第二次选择其下一个10000具有is_update 0的记录,依此类推...希望这对您有所帮助。

Look at oak-chunk-update tool.查看 oak-chunk-update 工具。 It is one of the best tool if you want to update billion of rows too;)如果您也想更新十亿行,它是最好的工具之一;)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM