使用 CROSS JOIN 的超慢查詢

Question

我有兩個名為 table_1 (1GB) 和 reference (250Mb) 的表。

當我在參考上查詢交叉連接時，更新 table_1 需要 16 小時。我們為 XFS 更改了系統文件 EXT3，但仍然需要 16 小時。我做錯了什么？

這是更新/交叉連接查詢：

  mysql> UPDATE table_1 CROSS JOIN reference ON
  -> (table_1.start >= reference.txStart AND table_1.end <= reference.txEnd)
  -> SET table_1.name = reference.name;
  Query OK, 17311434 rows affected (16 hours 36 min 48.62 sec)
  Rows matched: 17311434  Changed: 17311434  Warnings: 0

這是 table_1 的 show create table 和參考：

    CREATE TABLE `table_1` (
     `strand` char(1) DEFAULT NULL,
     `chr` varchar(10) DEFAULT NULL,
     `start` int(11) DEFAULT NULL,
     `end` int(11) DEFAULT NULL,
     `name` varchar(255) DEFAULT NULL,
     `name2` varchar(255) DEFAULT NULL,
     KEY `annot` (`start`,`end`)
   ) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;


   CREATE TABLE `reference` (
     `bin` smallint(5) unsigned NOT NULL,
     `name` varchar(255) NOT NULL,
     `chrom` varchar(255) NOT NULL,
     `strand` char(1) NOT NULL,
     `txStart` int(10) unsigned NOT NULL,
     `txEnd` int(10) unsigned NOT NULL,
     `cdsStart` int(10) unsigned NOT NULL,
     `cdsEnd` int(10) unsigned NOT NULL,
     `exonCount` int(10) unsigned NOT NULL,
     `exonStarts` longblob NOT NULL,
     `exonEnds` longblob NOT NULL,
     `score` int(11) DEFAULT NULL,
     `name2` varchar(255) NOT NULL,
     `cdsStartStat` enum('none','unk','incmpl','cmpl') NOT NULL,
     `cdsEndStat` enum('none','unk','incmpl','cmpl') NOT NULL,
     `exonFrames` longblob NOT NULL,
      KEY `chrom` (`chrom`,`bin`),
      KEY `name` (`name`),
      KEY `name2` (`name2`),
      KEY `annot` (`txStart`,`txEnd`)
   ) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;

Answer 1

您應該索引table_1.start 、 reference.txStart 、 table_1.end和reference.txEnd表字段：

ALTER TABLE `table_1` ADD INDEX ( `start` ) ;
ALTER TABLE `table_1` ADD INDEX ( `end` ) ;
ALTER TABLE `reference` ADD INDEX ( `txStart` ) ;
ALTER TABLE `reference` ADD INDEX ( `txEnd` ) ;

Answer 2

交叉連接是笛卡爾積，它可能是計算成本最高的東西之一（它們不能很好地擴展）。

對於i = 1到n的每個表T_i，交叉表T_1到T_n生成的行數是每個表的大小乘以其他表的大小，即

|T_1| * |T_2| *... * |T_n|

假設每個表有 M 行，則計算交叉連接的最終成本為

M_1 * M_2 ... M_n = O(M^n)

這是連接中涉及的表數量的指數。

Answer 3

嘗試這個：

UPDATE table_1 SET
table_1.name = (
  select reference.name
  from reference
  where table_1.start >= reference.txStart
  and table_1.end <= reference.txEnd)

Answer 4

我看到UPDATE語句有 2 個問題。

End字段沒有索引。 您擁有的復合索引 ( annot ) 將僅用於此查詢中的start字段。 您應該按照 Emre 的建議添加它們：

ALTER TABLE `table_1` ADD INDEX ( `end` ) ;
ALTER TABLE `reference` ADD INDEX ( `txEnd` ) ;

其次， JOIN可能（並且可能確實）找到與table_1行相關的許多表reference行。 因此，被更新的table_1的一些（或所有）行會被更新很多次。 檢查此查詢的結果，看看它是否與您更新的行數相同（ 17311434 ）：

SELECT COUNT(*)
FROM table_1
  WHERE EXISTS
    ( SELECT *
      FROM reference
      WHERE table_1.start >= reference.txStart
        AND table_1.`end` <= reference.txEnd
    )

可以有其他方法來編寫此查詢，但是兩個表上都缺少PRIMARY KEY使得它更難。 如果你在table_1上定義了一個主鍵，試試這個，用主鍵替換id 。

更新：不，不要在有 34M 行的表上嘗試。 檢查執行計划並首先嘗試使用較小的表。

UPDATE table_1 AS t1
  JOIN 
    ( SELECT t2.id
           , r.name
      FROM table_1 AS t2
        JOIN
          ( SELECT name, txStart, txEnd
            FROM reference
            GROUP BY txStart, txEnd
          ) AS r
          ON  t2.start >= r.txStart
          AND t2.`end` <= r.txEnd
      GROUP BY t2.id
    ) AS good
    ON good.id = t1.id
SET t1.name = good.name;

您可以通過在等效的 SELECT 上運行 EXPLAIN 來檢查查詢計划：

EXPLAIN
SELECT t1.id, t1.name, good.name
FROM table_1 AS t1
  JOIN 
    ( SELECT t2.id
           , r.name
      FROM table_1 AS t2
        JOIN
          ( SELECT name, txStart, txEnd
            FROM reference
            GROUP BY txStart, txEnd
          ) AS r
          ON  t2.start >= r.txStart
          AND t2.`end` <= r.txEnd
      GROUP BY t2.id
    ) AS good
    ON good.id = t1.id ;

Answer 5

有人已經建議您添加一些索引。 但我認為使用這兩個索引可以獲得的最佳性能：

ALTER TABLE `test`.`time` 
    ADD INDEX `reference_start_end` (`txStart` ASC, `txEnd` ASC),
    ADD INDEX `table_1_star_end` (`start` ASC, `end` ASC);

MySQL 查詢只會使用其中一個，但 MySQL 將自動決定哪個更有用。

使用 CROSS JOIN 的超慢查詢

問題描述

5 個解決方案

解決方案1
4 2011-07-04 02:42:53

解決方案2
1

解決方案3
0 2011-07-04 02:45:31

解決方案4
0 已采納 2011-07-04 09:13:19

解決方案5
0 2011-07-06 07:41:00

使用 CROSS JOIN 的超慢查詢

問題描述

5 個解決方案

解決方案1 4 2011-07-04 02:42:53

解決方案2 1

解決方案3 0 2011-07-04 02:45:31

解決方案4 0 已采納 2011-07-04 09:13:19

解決方案5 0 2011-07-06 07:41:00

解決方案1
4 2011-07-04 02:42:53

解決方案2
1

解決方案3
0 2011-07-04 02:45:31

解決方案4
0 已采納 2011-07-04 09:13:19

解決方案5
0 2011-07-06 07:41:00