简体   繁体   中英

SQL beginner - Why does my MySQL Join-query use no index / work so slow?

I got the following problem with my MySQL 5.5 DB - i am pretty new to this so it might be very obvious whats wrong but i just cant seem to get it:

Two tables:


Table1

CREATE TABLE `sequence_matches` (
    `Sample_ID` INT(6) NOT NULL,
    `Sequence_Match_ID` INT(8) NOT NULL,
    `Start` INT(6) NULL DEFAULT NULL,
    `End` INT(6) NULL DEFAULT NULL,
    `Coverage` DOUBLE(5,2) NULL DEFAULT NULL,
    `Frag_String` VARCHAR(255) NULL DEFAULT NULL,
    `rms_mass_error_prod` DOUBLE(10,4) NULL DEFAULT NULL,
    `rms_rt_error_prod` DOUBLE(10,4) NULL DEFAULT NULL,
  PRIMARY KEY (`Sample_ID`, `Sequence_Match_ID`)
)

and


Table 2

CREATE TABLE `peptide_identifications` (
   `Sample_ID` INT(6) NOT NULL,
   `Peptide_identification_ID` INT(8) NOT NULL,
   `Mass_error` DOUBLE(10,4) NULL DEFAULT NULL,
   `Mass_error_ppm` DOUBLE(10,4) NULL DEFAULT NULL,
   `Score` DOUBLE(10,4) NULL DEFAULT NULL,
   `Type` VARCHAR(45) NULL DEFAULT NULL,
   `global_pept_ID` INT(8) NOT NULL,
  PRIMARY KEY (`Sample_ID`, `Peptide_identification_ID`),
  INDEX `Index` (`global_pept_ID`)
)

each of them contains ~15 million rows.

Now, i want all those rows from Table2 where global_pept_id = 27443 and then use the peptide_identification_id of those, to query all information from Table1 where peptide_identification_id = sequence_match_id .

I tried the following statement:

SELECT * from sequence_matches 
JOIN (
  SELECT peptide_identification_id 
  FROM peptide_identifications 
  WHERE global_pept_id = 27443
) as tmp_pept 
ON sequence_match_id = peptide_identification_id; 

Here the Explain for that query:

http://i.stack.imgur.com/QV3ER.jpg (click to enlarge)

Now this query is very, very slow (i actually never finished it, stopepd it after ~10min) and i can imagine it's because there is no Index used for the second table although both ID's are primary key and thus they should be indexed right?

The results for the inner select require ~3 sek and return ~3k rows if performed alone. So the i think the problem is making 3000 * 15mio compares cause every row is checked in Table2.

But how do i fix this?

any help appreciated -voiD

It's probably because you're joining on a subquery. Try:

SELECT sm.*, pi.peptide_identification_id
FROM sequence_matches sm
INNER JOIN peptide_identifications pi
ON sm.id = pi.peptide_identification_id
WHERE pi.global_pept_id = 27443

Slightly different than other solutions. Consider the primary criteria you are trying to get first... those peptide elements for a given global peptide value. Ensure you have an index on this table on any such criteria you may be querying against (which you have). However, if you find you will be querying on more than one WHERE condition against the same table, try to prepare/have an index that will help on BOTH criteria.

Then, put a JOIN condition to the other table on the PK/FK relationship to get those records.

SELECT * 
   from 
      peptide_identification PI
         JOIN sequence_matches SM
            ON PI.peptide_identification_id = SM.sequence_match_id
   WHERE 
      PI.global_pept_id = 27443

Without having proper indexes can significantly kill a query's performance. Your Sequence_Matches table should have an index on just (Sequence_match_ID) to help its optimization. Having it in the second position (after the sample_id), will not benefit as expected.

A tipp would be to avoid subselects. Sometimes they are great, but usually result in poor performance. A better way might be:

SELECT * from peptide_identification as tmp_pept
JOIN sequence_matches  
    ON sequence_matches.sequence_match_id = tmp_pept.peptide_identification_id
WHERE tmp_pept.global_pept_id = 27443

Does that do the trick?

Edit: No, the real problem is that there is no index on sequence_match_id. Add one and you'll probably be fine.

i think the problem could be because ur now creating a cross join instead of an inner join. your subquery is creating a cartesian product of 15 million rows * 3 million rows.

using an inner join you wud reduce that number to 15 million*3000 rows.

it is still a huge number. At the Sql end u can restrict it by issuing a TOP 10 or TOP 20.

At the front end, if it is C# u should use paging techniques like gridviewpager or other paging techniques on the datasource assuming u are going to display the result at the front end which will sit on top of your SQL Join Query and page 20 results at a time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM