[英]Super slow query with CROSS JOIN
I have two tables named table_1 (1GB) and reference (250Mb).我有两个名为 table_1 (1GB) 和 reference (250Mb) 的表。
When I query a cross join on reference it takes 16hours to update table_1.. We changed the system files EXT3 for XFS but still it's taking 16hrs.. WHAT AM I DOING WRONG??当我在参考上查询交叉连接时,更新 table_1 需要 16 小时。我们为 XFS 更改了系统文件 EXT3,但仍然需要 16 小时。我做错了什么?
Here is the update/cross join query:这是更新/交叉连接查询:
mysql> UPDATE table_1 CROSS JOIN reference ON
-> (table_1.start >= reference.txStart AND table_1.end <= reference.txEnd)
-> SET table_1.name = reference.name;
Query OK, 17311434 rows affected (16 hours 36 min 48.62 sec)
Rows matched: 17311434 Changed: 17311434 Warnings: 0
Here is a show create table of table_1 and reference:这是 table_1 的 show create table 和参考:
CREATE TABLE `table_1` (
`strand` char(1) DEFAULT NULL,
`chr` varchar(10) DEFAULT NULL,
`start` int(11) DEFAULT NULL,
`end` int(11) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
`name2` varchar(255) DEFAULT NULL,
KEY `annot` (`start`,`end`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;
CREATE TABLE `reference` (
`bin` smallint(5) unsigned NOT NULL,
`name` varchar(255) NOT NULL,
`chrom` varchar(255) NOT NULL,
`strand` char(1) NOT NULL,
`txStart` int(10) unsigned NOT NULL,
`txEnd` int(10) unsigned NOT NULL,
`cdsStart` int(10) unsigned NOT NULL,
`cdsEnd` int(10) unsigned NOT NULL,
`exonCount` int(10) unsigned NOT NULL,
`exonStarts` longblob NOT NULL,
`exonEnds` longblob NOT NULL,
`score` int(11) DEFAULT NULL,
`name2` varchar(255) NOT NULL,
`cdsStartStat` enum('none','unk','incmpl','cmpl') NOT NULL,
`cdsEndStat` enum('none','unk','incmpl','cmpl') NOT NULL,
`exonFrames` longblob NOT NULL,
KEY `chrom` (`chrom`,`bin`),
KEY `name` (`name`),
KEY `name2` (`name2`),
KEY `annot` (`txStart`,`txEnd`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;
You should index table_1.start
, reference.txStart
, table_1.end
and reference.txEnd
table fields:您应该索引
table_1.start
、 reference.txStart
、 table_1.end
和reference.txEnd
表字段:
ALTER TABLE `table_1` ADD INDEX ( `start` ) ;
ALTER TABLE `table_1` ADD INDEX ( `end` ) ;
ALTER TABLE `reference` ADD INDEX ( `txStart` ) ;
ALTER TABLE `reference` ADD INDEX ( `txEnd` ) ;
Cross joins are Cartesian Products, which are probably one of the most computationally expensive things to compute (they don't scale well).交叉连接是笛卡尔积,它可能是计算成本最高的东西之一(它们不能很好地扩展)。
For each table T_i for i = 1 to n, the number of rows generated by crossing tables T_1 to T_n is the size of each table multiplied by the size of each other table, ie对于i = 1到n的每个表T_i,交叉表T_1到T_n生成的行数是每个表的大小乘以其他表的大小,即
|T_1|
|T_1| * |T_2|
* |T_2| *... * |T_n|
*... * |T_n|
Assuming each table has M rows, the resulting cost of computing the cross join is then假设每个表有 M 行,则计算交叉连接的最终成本为
M_1 * M_2 ... M_n = O(M^n)
M_1 * M_2 ... M_n = O(M^n)
which is exponential in the number of tables involved in the join.这是连接中涉及的表数量的指数。
Try this:尝试这个:
UPDATE table_1 SET
table_1.name = (
select reference.name
from reference
where table_1.start >= reference.txStart
and table_1.end <= reference.txEnd)
I see 2 problems with the UPDATE
statement.我看到
UPDATE
语句有 2 个问题。
There is no index for the End
fields. End
字段没有索引。 The compound indexes ( annot
) you have will be used only for the start
fields in this query.您拥有的复合索引 (
annot
) 将仅用于此查询中的start
字段。 You should add them as suggested by Emre:您应该按照 Emre 的建议添加它们:
ALTER TABLE `table_1` ADD INDEX ( `end` ) ;
ALTER TABLE `reference` ADD INDEX ( `txEnd` ) ;
Second, the JOIN
may (and probably does) find many rows of table reference
that are related to a row of table_1
.其次,
JOIN
可能(并且可能确实)找到与table_1
行相关的许多表reference
行。 So some (or all) rows of table_1
that are updated, are updated many times.因此,被更新的
table_1
的一些(或所有)行会被更新很多次。 Check the result of this query, to see if it is the same as your updated rows count ( 17311434
):检查此查询的结果,看看它是否与您更新的行数相同(
17311434
):
SELECT COUNT(*)
FROM table_1
WHERE EXISTS
( SELECT *
FROM reference
WHERE table_1.start >= reference.txStart
AND table_1.`end` <= reference.txEnd
)
There can be other ways to write this query but the lack of a PRIMARY KEY
on both tables makes it harder.可以有其他方法来编写此查询,但是两个表上都缺少
PRIMARY KEY
使得它更难。 If you define a primary key on table_1
, try this, replacing id
with the primary key.如果你在
table_1
上定义了一个主键,试试这个,用主键替换id
。
Update : No, do not try it on a table with 34M rows.更新:不,不要在有 34M 行的表上尝试。 Check the execution plan and try with smaller tables first.
检查执行计划并首先尝试使用较小的表。
UPDATE table_1 AS t1
JOIN
( SELECT t2.id
, r.name
FROM table_1 AS t2
JOIN
( SELECT name, txStart, txEnd
FROM reference
GROUP BY txStart, txEnd
) AS r
ON t2.start >= r.txStart
AND t2.`end` <= r.txEnd
GROUP BY t2.id
) AS good
ON good.id = t1.id
SET t1.name = good.name;
You can check the query plan by running EXPLAIN on the equivalent SELECT:您可以通过在等效的 SELECT 上运行 EXPLAIN 来检查查询计划:
EXPLAIN
SELECT t1.id, t1.name, good.name
FROM table_1 AS t1
JOIN
( SELECT t2.id
, r.name
FROM table_1 AS t2
JOIN
( SELECT name, txStart, txEnd
FROM reference
GROUP BY txStart, txEnd
) AS r
ON t2.start >= r.txStart
AND t2.`end` <= r.txEnd
GROUP BY t2.id
) AS good
ON good.id = t1.id ;
Somebody already offered you to add some indexes.有人已经建议您添加一些索引。 But I think the best performance you may get with these two indexes:
但我认为使用这两个索引可以获得的最佳性能:
ALTER TABLE `test`.`time`
ADD INDEX `reference_start_end` (`txStart` ASC, `txEnd` ASC),
ADD INDEX `table_1_star_end` (`start` ASC, `end` ASC);
Only one of them will be used by MySQL query, but MySQL will decide which is more useful automatically. MySQL 查询只会使用其中一个,但 MySQL 将自动决定哪个更有用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.