简体   繁体   中英

How to optimize this inner join query to decrease query time

I have a table that now has about 1m rows. The following query need about 5 seconds to complete. What you suggest optimizing query speed?

# Thread_id: 14  Schema: defrop_defrop  QC_hit: No
# Query_time: 5.573048  Lock_time: 0.591625  Rows_sent: 0  Rows_examined: 1006391
# Rows_affected: 1
UPDATE `backlinks` as a
INNER JOIN(SELECT b.`id` as bid
           FROM `backlinks` b
           WHERE b.`googlebot_id` IS NULL AND b.`used_time` IS NULL AND 
b.`campaign_id` IN  (SELECT `id` FROM `campaigns` WHERE `status`=true) GROUP BY b.`campaign_id` ORDER BY RAND() limit 1
           ) as c
 ON (a.id = c.bid)
SET a.`crawler_id` = '10.0.0.13', a.`used_time`=NOW();

campaign_id, googlebot_id are foreing keys, indexers. used_time and crawler_id are indexers Screenshot from the table phpmyadmin Phpmyadmin表反向链接

This is the query formatted so I can read it better:

UPDATE backlinks bl JOIN
       (SELECT bl2.id as bid
        FROM backlinks bl2
        WHERE bl2.googlebot_id IS NULL AND
              bl2.used_time IS NULL AND 
              bl2.campaign_id IN (SELECT c.id FROM campaigns c WHERE status = true)
       GROUP BY b.campaign_id
       ORDER BY RAND() 
       LIMIT 1
     ) bl2
     ON bl.id = bl2.bid
    SET bl.crawler_id = '10.0.0.13',
        bl.used_time = NOW();

First, the GROUP BY in the subquery is not needed. And I would replace the IN with EXISTS :

UPDATE backlinks bl JOIN
       (SELECT bl2.id as bid
        FROM backlinks bl2
        WHERE bl2.googlebot_id IS NULL AND
              bl2.used_time IS NULL AND 
              EXISTS (SELECT 1 FROM campaigns c WHERE bl2.campaign_id = c.id AND c.status = true)
        ORDER BY RAND() 
        LIMIT 1
      ) bl2
      ON bl.id = bl2.bid
    SET bl.crawler_id = '10.0.0.13',
        bl.used_time = NOW();

That will help a little, but probably not much. My guess is that the performance issue is the size of the outer sort (or equivalently, the size of the data needed for the GROUP BY in your query).

You can also get rid of the subquery entirely:

UPDATE backlinks bl
    SET bl.crawler_id = '10.0.0.13',
        bl.used_time = NOW()
WHERE bl.googlebot_id IS NULL AND
      bl.used_time IS NULL AND 
      EXISTS (SELECT 1 FROM campaigns c WHERE bl.campaign_id = c.id AND c.status = true)
ORDER BY RAND()
LIMIT 1;

This has minimal impact, but it cleans up the logic a bit.

My guess is that the WHERE conditions are not very selective, so optimizing them won't help much.

At this point, the issue is the ORDER BY RAND() . If you have any idea how many rows are being returned by the subquery, then you can pre-filter using RAND() . For instance, let me assume that there are at least 1,000 rows being returned. Then:

UPDATE backlinks bl
    SET bl.crawler_id = '10.0.0.13',
        bl.used_time = NOW()
WHERE bl.googlebot_id IS NULL AND
      bl.used_time IS NULL AND 
      EXISTS (SELECT 1 FROM campaigns c WHERE bl.campaign_id = c.id AND c.status = true) AND
      RAND() < 0.01  -- keep about 1/100
ORDER BY RAND()
LIMIT 1;

This significantly speeds the sort, because it is on 100th the data. However, it could filter out all rows -- if not enough rows match the conditions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM