简体   繁体   中英

MySQL INSERT INTO … SELECT … GROUP BY is too slow

I have a table with about 50M rows and format:

CREATE TABLE `big_table` (
  `id` BIGINT NOT NULL,
  `t1` DATETIME NOT NULL,
  `a` BIGINT NOT NULL,
  `type` VARCHAR(10) NOT NULL,
  `b` BIGINT NOT NULL,
  `is_c` BOOLEAN NOT NULL,
  PRIMARY KEY (`id`),
  INDEX `a_b_index` (a,b)
) ENGINE=InnoDB;

I then define the table t2 , with no indices:

Create table `t2` (
  `id` BIGINT NOT NULL,
  `a` BIGINT NOT NULL,
  `b` BIGINT NOT NULL,
  `t1min` DATETIME NOT NULL
 ) ENGINE=InnoDB DEFAULT CHARSET=latin1;

I then populate t2 using a query from big_table (this will add about 12M rows).

insert into opportunities
  (id, a,b,t1min)
  SELECT id,a,b,min(t1)
    FROM big_table use index (a_b_index)
    where type='SUBMIT' and is_c=1
   GROUP BY a,b;

I find that it takes this query about a minute to process 5000 distinct (a,b) in big_table .
Since there are 12M distinct (a,b) in big_table then it would take about 40 hours to run the query on all of big_table .

What is going wrong?

If I just do SELECT ... then the query does 5000 lines in about 2s. If I SELECT ... INTO OUTFILE ... , then the query still takes 60s for 5000 lines.

EXPLAIN SELECT ... gives:

id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,stdnt_intctn_t,index,NULL,a_b_index,16,NULL,46214255,"Using where"

I found that the problem was that the GROUP_BY resulted in too many random-access reads of big_table . The following strategy allows one sequential trip through big_table . First, we add a key to t2 :

Create table `t2` (
  `id` BIGINT NOT NULL,
  `a` BIGINT NOT NULL,
  `b` BIGINT NOT NULL,
  `t1min` DATETIME NOT NULL,
  PRIMARY KEY (a,b),
  INDEX `id` (id)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Then we fill t2 using:

insert into t2
  (id, a,b,t1min)
  SELECT id,a,b,t1
    FROM big_table
    where type='SUBMIT' and is_c=1
 ON DUPLICATE KEY UPDATE 
   t1min=if(t1<t1min,t1,t1min),
   id=if(t1<t1min,big_table.id,t2.id);

The resulting speed-up is several orders of magnitude.

The group by might be part of the issue. You are using an index on (a,b), but your where is not being utilized. I would have an index on

(type, is_c, a, b )

Also, you are getting the "ID", but not specifying which... you probably want to do a MIN(ID) for a consistent result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM