简体   繁体   English

在具有 10 多万行的表上使用 1 个连接优化查询

[英]Optimize query with 1 join, on tables with 10+ millions rows

I am looking at making a request using 2 tables faster.我正在考虑使用 2 个表更快地提出请求。
I have the following 2 tables :我有以下 2 个表:

Table "logs"表“日志”

  • id varchar(36) PK
  • date timestamp(2)
  • more varchar fields, and one text field更多 varchar 字段和一个文本字段

That table has what the PHP Laravel Framework calls a "polymorphic many to many" relationship with several other objects, so there is a second table "logs_pivot" :该表具有 PHP Laravel 框架所称的与其他几个对象的“多态多对多”关系,因此还有第二个表“logs_pivot”:

  • id unsigned int PK
  • log_id varchar(36) FOREIGN KEY (logs.id)
  • model_id varchar(40)
  • model_type varchar(50)

There is one or several entries in logs_pivot per entry in logs .有一个或几个条目logs_pivot每个条目logs They have 20+ and 10+ millions of rows, respectively.它们分别有 20+ 和 10+ 百万行。

We do queries like so :我们做这样的查询:

select * from logs 
join logs_pivot on logs.id = logs_pivot.log_id
where model_id = 'some_id' and model_type = 'My\Class'
order by date desc
limit 50;

Obviously we have a compound index on both the model_id and model_type fields, but the requests are still slow : several (dozens of) seconds every times.显然,我们在 model_id 和 model_type 字段上都有一个复合索引,但请求仍然很慢:每次都有几(几十)秒。
We also have an index on the date field, but an EXPLAIN show that this is the model_id_model_type index that is used.我们在date字段上也有一个索引,但EXPLAIN显示这是使用的model_id_model_type索引。

Explain statement:解释声明:

+----+-------------+-------------+------------+--------+--------------------------------------------------------------------------------+-----------------------------------------------+---------+-------------------------------------------+------+----------+---------------------------------+
| id | select_type | table       | partitions | type   | possible_keys                                                                  | key                                           | key_len | ref                                       | rows | filtered | Extra                           |
+----+-------------+-------------+------------+--------+--------------------------------------------------------------------------------+-----------------------------------------------+---------+-------------------------------------------+------+----------+---------------------------------+
|  1 | SIMPLE      | logs_pivot  | NULL       | ref    | logs_pivot_model_id_model_type_index,logs_pivot_log_id_index | logs_pivot_model_id_model_type_index | 364     | const,const                               |    1 |   100.00 | Using temporary; Using filesort |
|  1 | SIMPLE      | logs        | NULL       | eq_ref | PRIMARY                                                                        | PRIMARY                                       | 146     | the_db_name.logs_pivot.log_id |    1 |   100.00 | NULL                            |
+----+-------------+-------------+------------+--------+--------------------------------------------------------------------------------+-----------------------------------------------+---------+-------------------------------------------+------+----------+---------------------------------+

In other tables, I was able to make a similar request much faster by including the date field in the index.在其他表中,通过在索引中包含日期字段,我能够更快地发出类似的请求。 But in that case they are in a separate table.但在这种情况下,它们位于单独的表中。

When we want to access these data, they are typically a few hours/days old.当我们想要访问这些数据时,它们通常是几个小时/几天前的。
Our InnoDB pools are much too small to hold all that data (+ all the other tables) in memory, so the data is most probably always queried on disk.我们的 InnoDB 池太小,无法在内存中保存所有数据(+所有其他表),因此数据很可能总是在磁盘上查询。

What would be all the ways we could make that request faster ?我们可以通过哪些方式更快地提出该请求?
Ideally only with another index, or by changing how it is done.理想情况下,只使用另一个索引,或者通过改变它的完成方式。

Thanks a lot !非常感谢 !


Edit 17h05 :编辑 17h05 :
Thank you all for your answers so far, I will try something like O Jones suggest, and also to somehow include the date field in the pivot table, so that I can include in the index index.到目前为止,谢谢大家的回答,我会尝试像 O Jones 建议的那样,并以某种方式将日期字段包含在数据透视表中,以便我可以包含在索引索引中。


Edit 14/10 10h.编辑 14/10 10 小时。

Solution :解决方案 :

So I ended up changing how the request was really done, by sorting on the id field of the pivot table, which indeed allow to put it in an index.所以我最终改变了请求的真正完成方式,通过对数据透视表的 id 字段进行排序,这确实允许将其放入索引中。

Also the request to count the total number of rows is changed to only be done on the pivot table, when it is not filtered by date.此外,当未按日期过滤时,计算总行数的请求更改为仅在数据透视表上完成。

Thank you all !谢谢你们 !

I see two problems:我看到两个问题:

  • UUIDs are costly when tables are huge relative to RAM size.当表相对于 RAM 大小而言很大时,UUID 成本很高。

  • The LIMIT cannot be handled optimally because the WHERE clauses come from one table, but the ORDER BY column comes from another table.由于WHERE子句来自一个表,而ORDER BY列来自另一个表,因此无法以最佳方式处理LIMIT That is, it will do all of the JOIN , then sort and finally peel off a few rows.也就是说,它将执行所有JOIN ,然后排序并最后剥离几行。

Just a suggestion.只是一个建议。 Using a compound index is obviously a good thing.使用复合索引显然是一件好事。 Another might be to pre-qualify an ID by date, and extend your index based on your logs_pivot table indexing on (model_id, model_type, log_id ).另一种可能是按日期对 ID 进行资格预审,并根据 (model_id, model_type, log_id ) 上的 logs_pivot 表索引扩展您的索引。

If your querying data, and the entire history is 20+ million records, how far back does the data go where you are only dealing with getting a limit of 50 records per given category of model id/type.如果您的查询数据和整个历史记录是 20 多万条记录,那么数据可以追溯到多远,您只需要处理每个给定的模型 ID/类型类别的 50 条记录的限制。 Say 3-months?说3个月? vs say your log of 5 years? vs 说你 5 年的日志? (not listed in post, just a for-instance). (没有在帖子中列出,只是一个例子)。 So if you can query the minimum log ID where the date is greater than say 3 months back, that one ID can limit what else is going on from your logs_pivot table.因此,如果您可以查询日期大于 3 个月前的最小日志 ID,则该 ID 可以限制您的 logs_pivot 表中发生的其他事情。

Something like就像是

select
      lp.*,
      l.date
   from
      logs_pivot lp
         JOIN Logs l
            on lp.log_id = l.id
   where
          model_id = 'some_id' 
      and model_type = 'My\Class'
      and log_id >= ( select min( id )
                         from logs
                        where date >= datesub( curdate(), interval 3 month ))
   order by 
      l.date desc
   limit  
      50;

So, the where clause for the log_id is done once and returns just an ID from as far back as 3 months and not the entire history of the logs_pivot.因此,log_id 的 where 子句只执行一次,并且只返回 3 个月前的 ID,而不是 logs_pivot 的整个历史记录。 Then you query with the optimized two-part key of model id/type, but also jumping to the end of its index with the ID included in the index key to skip over all the historical.然后使用优化后的model id/type 两部分键进行查询,同时也使用索引键中包含的ID 跳转到其索引的末尾以跳过所有历史记录。

Another thing you MAY want to include are some pre-aggregate tables of how many records such as per month/year per given model type/id.您可能想要包含的另一件事是一些预聚合表,其中包含每个给定模型类型/ID 的记录数量,例如每月/每年。 Use that as a pre-query to present to users, then you can use that as a drill-down to further get more detail.将其用作向用户呈现的预查询,然后您可以将其用作深入了解以进一步获取更多详细信息。 A pre-aggregate table can be done on all the historical stuff once since it would be static and not change.预聚合表可以对所有历史内容进行一次,因为它是静态的并且不会改变。 The only one you would have to constantly update would be whatever the current single month period is, such as on a nightly basis.您唯一需要不断更新的是当前单月期间的内容,例如每晚更新一次。 Or even possibly better, via a trigger that either inserts a record every time an add is done, or updates a count for the given model/type based on year/month aggregations.或者甚至可能更好,通过触发器在每次添加完成时插入记录,或者根据年/月聚合更新给定模型/类型的计数。 Again, just a suggestion as no other context on how / why the data will be presented to the end-user.同样,只是一个建议,因为没有关于如何/为什么将数据呈现给最终用户的其他上下文。

SELECT columns FROM big table ORDER BY something LIMIT small number is a notorious query performance antipattern. SELECT columns FROM big table ORDER BY something LIMIT small number是一个臭名昭著的查询性能反模式。 Why?为什么? the server sorts a whole mess of long rows then discards almost all of them.服务器对一大堆长行进行排序,然后丢弃几乎所有的行。 It doesn't help that one of your columns is a LOB -- a TEXT column.您的其中一columns是 LOB -- 一个 TEXT 列并没有帮助。

Here's an approach that can reduce that overhead: Figure out which rows you want by finding the set of primary keys you want, then fetch the content of only those rows.这是一种可以减少这种开销的方法:通过查找所需的主键集来确定所需的行,然后仅获取这些行的内容。

What rows do you want?你想要什么行? This subquery finds them.这个子查询会找到它们。

                  SELECT id
                    FROM logs
                    JOIN logs_pivot 
                            ON logs.id = logs_pivot.log_id
                   WHERE logs_pivot.model_id = 'some_id'
                     AND logs_pivot.model_type = 'My\Class'
                   ORDER BY logs.date DESC
                   LIMIT 50

This does all the heavy lifting of working out the rows you want.这完成了计算您想要的行的所有繁重工作。 So, this is the query you need to optimize.因此,这是您需要优化的查询。

It can be accelerated by this index on logs可以通过logs上的这个索引来加速

CREATE INDEX logs_date_desc ON logs (date DESC);

and this three-column compound index on logs_pivot以及logs_pivot上的这个三列复合索引

CREATE INDEX logs_pivot_lookup ON logs_pivot (model_id, model_type, log_id);

This index is likely to be better, since the Optimizer will see the filtering on logs_pivot but not logs .这个索引可能会更好,因为优化器会看到对logs_pivot的过滤,而不是对logs的过滤。 Hence, it will look in logs_pivot first.因此,它将首先查看logs_pivot

Or maybe或者可能

CREATE INDEX logs_pivot_lookup ON logs_pivot (log_id, model_id, model_type);

Try one then the other to see which yields faster results.先尝试一个,然后再尝试另一个,看看哪个会产生更快的结果。 (I'm not sure how the JOIN will use the compound index.) (Or simply add both, and use EXPLAIN to see which one it uses.) (我不确定 JOIN 将如何使用复合索引。)(或者简单地添加两者,然后使用EXPLAIN来查看它使用的是哪一个。)

Then, when you're happy -- or satisfied anyway -- with the subquery's performance, use it to grab the rows you need, like this然后,当您对子查询的性能感到满意或满意时,使用它来获取所需的行,如下所示

SELECT * 
  FROM logs
  WHERE id IN (
                  SELECT id
                    FROM logs
                    JOIN logs_pivot 
                            ON logs.id = logs_pivot.log_id
                   WHERE logs_pivot.model_id = 'some_id'
                     AND model_type = 'My\Class'
                   ORDER BY logs.date DESC
                   LIMIT 50
              )
  ORDER BY date DESC

This works because it sorts less data.这是有效的,因为它对较少的数据进行排序。 The covering three-column index on logs_pivot will also help. logs_pivot上的覆盖三列索引也将有所帮助。

Notice that both the sub query and main query have ORDER BY clauses, to make sure the returned detail result set is in the order you need.请注意,子查询和主查询都有 ORDER BY 子句,以确保返回的详细结果集符合您需要的顺序。

Edit Darnit, been on MariaDB 10+ and MySQL 8+ so long I forgot about the old limitation.编辑Darnit,使用 MariaDB 10+ 和 MySQL 8+ 太久了,我忘记了旧的限制。 Try this instead.试试这个。

SELECT * 
  FROM logs
  JOIN (
                  SELECT id
                    FROM logs
                    JOIN logs_pivot 
                            ON logs.id = logs_pivot.log_id
                   WHERE logs_pivot.model_id = 'some_id'
                     AND model_type = 'My\Class'
                   ORDER BY logs.date DESC
                   LIMIT 50
        ) id_set ON logs.id = id_set.id
  ORDER BY date DESC

Finally, if you know you only care about rows newer than some certain time you can add something like this to your subquery.最后,如果您知道您只关心比某个时间更新的行,您可以将这样的内容添加到您的子查询中。

                  AND logs.date >= NOW() - INTERVAL 5 DAY

This will help a lot if you have tonnage of historical data in your table.如果您的表中有大量历史数据,这将很有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM