简体   繁体   中英

Super slow query with CROSS JOIN

I have two tables named table_1 (1GB) and reference (250Mb).

When I query a cross join on reference it takes 16hours to update table_1.. We changed the system files EXT3 for XFS but still it's taking 16hrs.. WHAT AM I DOING WRONG??

Here is the update/cross join query:

  mysql> UPDATE table_1 CROSS JOIN reference ON
  -> (table_1.start >= reference.txStart AND table_1.end <= reference.txEnd)
  -> SET table_1.name = reference.name;
  Query OK, 17311434 rows affected (16 hours 36 min 48.62 sec)
  Rows matched: 17311434  Changed: 17311434  Warnings: 0

Here is a show create table of table_1 and reference:

    CREATE TABLE `table_1` (
     `strand` char(1) DEFAULT NULL,
     `chr` varchar(10) DEFAULT NULL,
     `start` int(11) DEFAULT NULL,
     `end` int(11) DEFAULT NULL,
     `name` varchar(255) DEFAULT NULL,
     `name2` varchar(255) DEFAULT NULL,
     KEY `annot` (`start`,`end`)
   ) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;


   CREATE TABLE `reference` (
     `bin` smallint(5) unsigned NOT NULL,
     `name` varchar(255) NOT NULL,
     `chrom` varchar(255) NOT NULL,
     `strand` char(1) NOT NULL,
     `txStart` int(10) unsigned NOT NULL,
     `txEnd` int(10) unsigned NOT NULL,
     `cdsStart` int(10) unsigned NOT NULL,
     `cdsEnd` int(10) unsigned NOT NULL,
     `exonCount` int(10) unsigned NOT NULL,
     `exonStarts` longblob NOT NULL,
     `exonEnds` longblob NOT NULL,
     `score` int(11) DEFAULT NULL,
     `name2` varchar(255) NOT NULL,
     `cdsStartStat` enum('none','unk','incmpl','cmpl') NOT NULL,
     `cdsEndStat` enum('none','unk','incmpl','cmpl') NOT NULL,
     `exonFrames` longblob NOT NULL,
      KEY `chrom` (`chrom`,`bin`),
      KEY `name` (`name`),
      KEY `name2` (`name2`),
      KEY `annot` (`txStart`,`txEnd`)
   ) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;

You should index table_1.start , reference.txStart , table_1.end and reference.txEnd table fields:

ALTER TABLE `table_1` ADD INDEX ( `start` ) ;
ALTER TABLE `table_1` ADD INDEX ( `end` ) ;
ALTER TABLE `reference` ADD INDEX ( `txStart` ) ;
ALTER TABLE `reference` ADD INDEX ( `txEnd` ) ;

Cross joins are Cartesian Products, which are probably one of the most computationally expensive things to compute (they don't scale well).

For each table T_i for i = 1 to n, the number of rows generated by crossing tables T_1 to T_n is the size of each table multiplied by the size of each other table, ie

|T_1| * |T_2| *... * |T_n|

Assuming each table has M rows, the resulting cost of computing the cross join is then

M_1 * M_2 ... M_n = O(M^n)

which is exponential in the number of tables involved in the join.

Try this:

UPDATE table_1 SET
table_1.name = (
  select reference.name
  from reference
  where table_1.start >= reference.txStart
  and table_1.end <= reference.txEnd)

I see 2 problems with the UPDATE statement.

There is no index for the End fields. The compound indexes ( annot ) you have will be used only for the start fields in this query. You should add them as suggested by Emre:

ALTER TABLE `table_1` ADD INDEX ( `end` ) ;
ALTER TABLE `reference` ADD INDEX ( `txEnd` ) ;

Second, the JOIN may (and probably does) find many rows of table reference that are related to a row of table_1 . So some (or all) rows of table_1 that are updated, are updated many times. Check the result of this query, to see if it is the same as your updated rows count ( 17311434 ):

SELECT COUNT(*)
FROM table_1
  WHERE EXISTS
    ( SELECT *
      FROM reference
      WHERE table_1.start >= reference.txStart
        AND table_1.`end` <= reference.txEnd
    )

There can be other ways to write this query but the lack of a PRIMARY KEY on both tables makes it harder. If you define a primary key on table_1 , try this, replacing id with the primary key.

Update : No, do not try it on a table with 34M rows. Check the execution plan and try with smaller tables first.

UPDATE table_1 AS t1
  JOIN 
    ( SELECT t2.id
           , r.name
      FROM table_1 AS t2
        JOIN
          ( SELECT name, txStart, txEnd
            FROM reference
            GROUP BY txStart, txEnd
          ) AS r
          ON  t2.start >= r.txStart
          AND t2.`end` <= r.txEnd
      GROUP BY t2.id
    ) AS good
    ON good.id = t1.id
SET t1.name = good.name;

You can check the query plan by running EXPLAIN on the equivalent SELECT:

EXPLAIN
SELECT t1.id, t1.name, good.name
FROM table_1 AS t1
  JOIN 
    ( SELECT t2.id
           , r.name
      FROM table_1 AS t2
        JOIN
          ( SELECT name, txStart, txEnd
            FROM reference
            GROUP BY txStart, txEnd
          ) AS r
          ON  t2.start >= r.txStart
          AND t2.`end` <= r.txEnd
      GROUP BY t2.id
    ) AS good
    ON good.id = t1.id ;

Somebody already offered you to add some indexes. But I think the best performance you may get with these two indexes:

ALTER TABLE `test`.`time` 
    ADD INDEX `reference_start_end` (`txStart` ASC, `txEnd` ASC),
    ADD INDEX `table_1_star_end` (`start` ASC, `end` ASC);

Only one of them will be used by MySQL query, but MySQL will decide which is more useful automatically.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM