简体   繁体   English

MySQL LEFT JOIN或WHERE IN SUBQUERY

[英]MySQL LEFT JOIN or WHERE IN SUBQUERY

I need a piece of advice, building an app now and I need to run some queries on rather large tables, possibly at a very frequent rate, so I'm trying to get the best approach performance wise. 我需要一条建议,现在构建一个应用程序,我需要在相当大的表上运行一些查询,可能是非常频繁的,所以我试图获得最佳的方法性能明智。

I have the following 2 tables: 我有以下2个表格:

Albums: 专辑:

+---------------+--------------+------+-----+---------+----------------+
|     Field     |     Type     | Null | Key | Default |     Extra      |
+---------------+--------------+------+-----+---------+----------------+
| id            | int(11)      | NO   | PRI | NULL    | auto_increment |
| eventid       | int(11)      | NO   | MUL | NULL    |                |
| album         | varchar(200) | NO   |     | NULL    |                |
| filename      | varchar(200) | NO   |     | NULL    |                |
| obstacle_time | time         | NO   |     | NULL    |                |
+---------------+--------------+------+-----+---------+----------------+

and keywords: 和关键字:

+-------------+--------------+------+-----+---------+----------------+
|    Field    |     Type     | Null | Key | Default |     Extra      |
+-------------+--------------+------+-----+---------+----------------+
| id          | int(11)      | NO   | PRI | NULL    | auto_increment |
| eventid     | int(11)      | NO   | MUL | NULL    |                |
| filename    | varchar(200) | NO   |     | NULL    |                |
| bibnumbers  | varchar(200) | NO   |     | NULL    |                |
| gender      | varchar(20)  | YES  |     | NULL    |                |
| top_style   | varchar(20)  | YES  |     | NULL    |                |
| pants_style | varchar(20)  | YES  |     | NULL    |                |
| other       | varchar(20)  | YES  |     | NULL    |                |
| cap         | varchar(200) | NO   |     | NULL    |                |
| tshirt      | varchar(200) | NO   |     | NULL    |                |
| pants       | varchar(200) | NO   |     | NULL    |                |
+-------------+--------------+------+-----+---------+----------------+

Both table have a unique_index declared which is a constraint of the eventid+filename column. 两个表都声明了unique_index,它是eventid+filename列的约束。

Both table contains information about some images, but the albums table is available instantly (as soon as I have the images), while the keywords table usually becomes available several days later after a manual tagging of the images is completed 两个表都包含有关某些图像的信息,但是相册表立即可用(只要我有图像),而关键字表通常在手动标记图像完成几天后可用

Now I will have people searching for all kind of things once the tagging is enabled, but since the results can be HUGE (up to 10.000 or more) I'm only showing them in small chunks so the browser doesn't get killed with trying to load a huge amount of images, because of this my server will be hit with loads of query requests (every time the visitor scrolls to the bottom of the page, an ajax query will return the next chunk of images). 现在,一旦启用标记,我会让人们搜索所有类型的东西,但由于结果可能很大(高达10.000或更多)我只是以小块显示它们所以浏览器不会因尝试而被杀死加载大量图像,因此我的服务器将被大量的查询请求命中(每次访问者滚动到页面底部时,ajax查询将返回下一个图像块)。

Now my question is, which of the following queries is better performance wise: 现在我的问题是,以下哪个查询性能更好:

SELECT `albums`.`filename`,`basket`.`id`,`albums`.`id`,`obstacle_time`
FROM `albums`
LEFT JOIN `basket`
    ON `basket`.`eventid` = `albums`.`eventid`
        AND `basket`.`fileid` = `albums`.`id`
        AND `basket`.`visitor_id` = 1
LEFT JOIN `keywords`
    ON `keywords`.`eventid` = `albums`.`eventid`
        AND `albums`.`filename` = `keywords`.`filename`
WHERE
    `albums_2015`.`eventid` = 1
    AND `album` LIKE '%string%'
    AND `obstacle_time` >= '08:00:00'
    AND `obstacle_time` <= '14:11:10'
    AND `gender` = 1
    AND `top_style` REGEXP '[[:<:]]0[[:>:]]|[[:<:]]1[[:>:]]'
    AND `cap` = '2'
    AND `tshirt` = '1'
    AND `pants` = '3'
ORDER BY `obstacle_time`
LIMIT X, 10

OR using an IN CLAUSE inside WHERE like: 或者在WHERE使用IN CLAUSE:

SELECT `albums`.`filename`,`basket`.`id`,`albums`.`id`,`obstacle_time` 
FROM `albums` 
LEFT JOIN `basket` 
    ON `basket`.`eventid` = `albums`.`eventid` 
        AND `basket`.`fileid` = `albums`.`id` 
        AND `basket`.`visitor_id` = 1 
WHERE 
    `albums_2015`.`eventid` = 1 
    AND `album` LIKE '%string%' 
    AND `obstacle_time` >= '08:00:00' 
    AND `obstacle_time` <= '14:11:10' 
    AND `filename` IN (
        SELECT `filename` 
        FROM `keywrods` 
        WHERE
            `eventid` = 1 
            AND `gender` = 1 
            AND `top_style` REGEXP '[[:<:]]0[[:>:]]|[[:<:]]1[[:>:]]' 
            AND `cap` = '2' 
            AND `tshirt` = '1' 
            AND `pants` = '3'
    )
ORDER BY `obstacle_time`
LIMIT X, 10

I had looked to similar questions but wasn't able to figure it out which is the best course of action. 我曾经看过类似的问题,但无法弄清楚哪个是最好的行动方案。

My understanding so far is that: 到目前为止,我的理解是:

  • Using LEFT JOIN takes advantages of INDEXING, BUT!!! 使用LEFT JOIN可以利用INDEXING,但是!!! if I use it I will get a full join of the tables even when I only need a significantly smaller result set, so it's almost a wast to join thousands of rows just to then filter out most of them. 如果我使用它,即使我只需要一个非常小的结果集,我也会获得表的完全连接,因此加入数千行只是为了过滤掉大部分行,这几乎是一个浪费。

  • Using IN and subquery isn't indexed??? 使用IN和子查询没有索引??? I'm not 100% sure about this, I'm using MySQL 5.6 and to the best of my understanding since 5.6 even subqueries get automatically indexed my MySQL. 我不是百分之百地确定这一点,我使用的是MySQL 5.6并且我的理解最好,因为5.6甚至子查询都会自动索引我的MySQL。 I think this method has benefits when there result is significantly filtered, not sure if there will be any benefit if the subquery will return all the possible filenames. 我认为当结果被显着过滤时,此方法会有好处,如果子查询将返回所有可能的文件名,则不确定是否会有任何好处。

As footnote questions: 作为脚注问题:

  • Should I consider returning the whole result to the client on the first query and use client side (HTML) techniques to load the images gradually rather than re-querying the server each time? 我是否应该考虑在第一个查询时将整个结果返回给客户端,并使用客户端(HTML)技术逐步加载图像而不是每次都重新查询服务器?

  • Should I consider merging the 2 tables into 1, how much of a performance impact will that have? 我是否应该考虑将2个表合并为1,这将产生多大的性能影响? (can be tricky due to various reasons, which have no place in the question) (由于种种原因可能很棘手,问题中没有任何地方)

Thanks. 谢谢。

EDIT 1 编辑1

Explain for JOIN query: 解释JOIN查询:

+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+
| id | select_type |     table     |  type  | possible_keys |     key      | key_len |                  ref                   | rows |                       Extra                        |
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+
|  1 | SIMPLE      | albums_2015   | ref    | unique_index  | unique_index | 4       | const                                  | 6475 | Using where; Using temporary; Using filesort       |
|  1 | SIMPLE      | basket        | ALL    | NULL          | NULL         | NULL    | NULL                                   |    2 | Using where; Using join buffer (Block Nested Loop) |
|  1 | SIMPLE      | keywords_2015 | eq_ref | unique_index  | unique_index | 206     | const,mybibnumber.albums_2015.filename |    1 | Using index                                        |
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+

Using WHERE IN: 使用WHERE IN:

+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+--+
| id | select_type |     table     |  type  | possible_keys |     key      | key_len |                  ref                   | rows |                       Extra                        |  |
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+--+
|  1 | SIMPLE      | albums_2015   | ref    | unique_index  | unique_index | 4       | const                                  | 6475 | Using where; Using temporary; Using filesort       |  |
|  1 | SIMPLE      | keywords_2015 | eq_ref | unique_index  | unique_index | 206     | const,mybibnumber.albums_2015.filename |    1 | Using where                                        |  |
|  1 | SIMPLE      | basket        | ALL    | NULL          | NULL         | NULL    | NULL                                   |    2 | Using where; Using join buffer (Block Nested Loop) |  |
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+--+

EDIT 2 编辑2

I wasn't able to set up a SQL Fiddler (keep getting error of something went wrong), so I have created a test database on one of my servers. 我无法设置SQL Fiddler(不断出错的错误),所以我在我的一台服务器上创建了一个测试数据库。

Address: http://188.165.217.185/phpmyadmin/ , user: temp_test , pass: test_temp 地址: http//188.165.217.185/phpmyadmin/,usertemp_test ,pass: test_temp

I'm still building the whole thing and I don't have all the values filled in yet, like top_style, pants_style, etc, so a more appropriate query for the test scenario will be: 我还在构建整个东西,但我还没有填写所有值,比如top_style,pants_style等,所以测试场景的更合适的查询将是:

WHERE IN: 在哪里:

SELECT `albums_2015`.`filename`, 
       `albums_2015`.`id`, 
       `obstacle_time` 
FROM   `albums_2015` 
WHERE  `albums_2015`.`eventid` = 1 
       AND `album` LIKE '%' 
       AND `obstacle_time` >= '08:00:00' 
       AND `obstacle_time` <= '14:11:10' 
       AND `filename` IN (SELECT `filename` 
                          FROM   `keywords_2015` 
                          WHERE  eventid = 1 
                                 AND 
               `bibnumbers` REGEXP '[[:<:]]113[[:>:]]|[[:<:]]106[[:>:]]') 
ORDER  BY `obstacle_time` 
LIMIT  0, 10 

LEFT JOIN LEFT JOIN

SELECT `albums_2015`.`filename`,`albums_2015`.`id`,`obstacle_time`
    FROM `albums_2015`
        LEFT JOIN `keywords_2015`
        ON `keywords_2015`.`eventid` = `albums_2015`.`eventid`
            AND `albums_2015`.`filename` = `keywords_2015`.`filename`
    WHERE
        `albums_2015`.`eventid` = 1
        AND `album` LIKE '%'
        AND `obstacle_time` >= '08:00:00'
        AND `obstacle_time` <= '14:11:10'

        AND `bibnumbers` REGEXP '[[:<:]]113[[:>:]]|[[:<:]]106[[:>:]]'

    ORDER BY `obstacle_time`
    LIMIT 0, 10

More a bunch of tips : 更多一些提示:

  • Join using index are the best if you have to deal with multi table query, 如果你必须处理多表查询,加入使用索引是最好的,

Don't mind adding some index to speed up your query (index take space, but on INT field it's nothing and you gain way more than you lose). 不要介意添加一些索引来加速你的查询(索引占用空间,但在INT字段上它没什么,你获得的收益远远超过你的损失)。


  • In case of big table, caching the data in the distant table is usually a good idea. 在大表的情况下,缓存远程表中的数据通常是个好主意。

An insert Trigger on TAG_table that cache the displayed part in the distant table (like the tag name for the overview of albums) can help you keeping your join query at a descent frequency. TAG_table上的插入触发器用于缓存远程表格中显示的部分(如专辑概述的标记名称),可帮助您将连接查询保持在下降频率。


  • Be careful with REGEX , it's something that hurt badly the perf . 注意REGEX ,它会严重伤害穿孔 Adding a new table to split data is a better idea (and use indexing which is native optimisation) 添加新表以分割数据是一个更好的主意(并使用索引,这是本机优化)

  • For every field in a WHERE clause of a big and frequent query you should have an index on it. 对于大而频繁查询的WHERE子句中的每个字段,您应该有一个索引。 If you can't put one, then your DB model is f**cked-up and need to be changed. 如果你不能放一个,那么你的数据库模型就可以了,需要更改。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM