简体   繁体   English

用于简单选择查询的 MySQL 优化器索引选择

[英]MySQL optimizer index choice for simple select query

I have a table called PendingExpense with a few simple columns but a lot of rows.我有一个名为PendingExpense的表,它有几个简单的列,但有很多行。 I'm working on some queries for paginated GET responses, but running into some confusion working with the queries and the MySQL optimizer seemingly making a senseless decision to do a full index scan for the ORDER BY clause before filtering from the WHERE clause.我正在处理一些针对分页 GET 响应的查询,但是在使用查询时遇到了一些混乱,MySQL 优化器似乎做出了一个毫无意义的决定,即在从WHERE子句过滤之前对ORDER BY子句进行全索引扫描。

This is on MySQL version 8.0.23.这是在 MySQL 版本 8.0.23 上。

PendingExpense DDL (note, a companyId and loginCredentialId is how I specify a user in my schema): PendingExpense DDL(注意,companyId 和 loginCredentialId 是我在架构中指定用户的方式):

create table PendingExpense
(
    ID                        bigint   auto_increment primary key,
    LOGINCREDENTIALID         int      null,
    COMPANYID                 int      null,
    DATE                      datetime null,
    -- ... other rows that don't pertain, e.g. amount, status, type, state, country, merchant
)

create index IN_PendingExpense_COMPANYID_ASC_LOGINCREDENTIALID_ASC
    on PendingExpense (COMPANYID, LOGINCREDENTIALID);

create index IN_PendingExpense_LOGINCREDENTIALID_ASC
    on PendingExpense (LOGINCREDENTIALID);

create index IN_PendingExpense_Date
    on PendingExpense (DATE);

Then here are the two queries I'm comparing, they are identical other than the index hint.然后这是我正在比较的两个查询,除了索引提示之外它们是相同的。 I'm including the execution plans for both immediately below:我在下面包括了两者的执行计划:

Query 1 (no hints):查询 1(无提示):

explain analyze select id from PendingExpense
where COMPANYID = 1641 and LOGINCREDENTIALID = 2451
order by date DESC, id DESC
limit 101; -- takes 5.5 seconds
-> Limit: 101 row(s)  (cost=2356102.00 rows=101) (actual time=2292.676..4474.843 rows=101 loops=1)
    -> Filter: ((PendingExpense.LOGINCREDENTIALID = 2451) and (PendingExpense.COMPANYID = 1641))  (cost=2356102.00 rows=105) (actual time=2292.675..4474.818 rows=101 loops=1)
        -> Index scan on PendingExpense using IN_PendintExpense_Date (reverse)  (cost=2356102.00 rows=5660) (actual time=0.088..4371.774 rows=1491859 loops=1)

Query 2 (index hint):查询 2(索引提示):

explain analyze select id from PendingExpense use index (IN_PendingExpense_COMPANYID_ASC_LOGINCREDENTIALID_ASC)
where COMPANYID = 1641 and LOGINCREDENTIALID = 2451
order by date desc, id desc
limit 101; -- .184 seconds
-> Limit: 101 row(s)  (cost=9722.30 rows=101) (actual time=38.255..38.267 rows=101 loops=1)
    -> Sort: PendingExpense.`DATE` DESC, PendingExpense.ID DESC, limit input to 101 row(s) per chunk  (cost=9722.30 rows=27778) (actual time=38.254..38.259 rows=101 loops=1)
        -> Index lookup on PendingExpense using IN_PendingExpense_COMPANYID_ASC_LOGINCREDENTIALID_ASC (COMPANYID=1641, LOGINCREDENTIALID=2451)  (actual time=0.046..35.410 rows=14170 loops=1)

Essentially, I'm confused why MySql chooses to do the full index scan first before filtering on companyId / loginCredentialId when the index already exists for those two, causing significant inefficiencies.从本质上讲,我很困惑为什么 MySql 选择在过滤 companyId / loginCredentialId 之前先进行完整索引扫描,而这两个索引已经存在,从而导致效率显着降低。 I'd much prefer to not have to specify index hints in my code/queries for cleanliness.我更希望不必在我的代码/查询中指定索引提示以保持整洁。 I was under the impression MySQL generally chooses to run the where clause filtering first, especially if an index already exists for it.我的印象是 MySQL 通常选择首先运行 where 子句过滤,特别是如果它已经存在索引。

Any help / hints / insight would be appreciated here.任何帮助/提示/见解都会在这里受到赞赏。 Thanks!谢谢!

This composite, covering, index should be perfect for that query:这个复合的覆盖索引应该非常适合该查询:

INDEX(COMPANYID, LOGINCREDENTIALID,   -- in either order
      date, id)    -- last, in this order

The first two columns are tested via = , allowing the INDEX rows to be precisely found.前两列通过=进行测试,从而可以精确找到 INDEX 行。

The last two rows can be scanned backward to perfectly go through the index.最后两行可以向后扫描以完美地通过索引。

'Covering' '覆盖'

Since all the necessary rows are in the index (hence "covering" aka "Using index"), the data's BTree does not need to be touched.由于所有必要的行都在索引中(因此“覆盖”又名“使用索引”),因此不需要触及数据的 BTree。

The entire table lives in a B+Tree;整个表存在于 B+Tree 中; it is ordered by the PRIMARY KEY .它由PRIMARY KEY排序。 Hence, it is efficient to lookup a single row or range of rows based on the PK.因此,基于 PK 查找单行或行范围是有效的。

Each "secondary" index is a very similar B+Tree.每个“二级”索引都是一个非常相似的 B+Tree。 It contains all the column(s) specified in the index, plus (silently) all the column(s) of the PK.它包含索引中指定的所有列,以及(静默)PK 的所有列。 That is, with也就是说,与

PRIMARY KEY(id),  INDEX(foo, bar)

the secondary index is really a B+Tree indexed by (foo, bar, id) .二级索引实际上是由(foo, bar, id)索引的 B+Tree。 When those columns are all that is needed for a SELECT , the index is "covering" and only that B+Tree is looked at.当这些列是SELECT所需的全部时,索引是“覆盖”的,并且查看 B+Tree。 If you need other columns, then id (in this example) is used to reach into the data's BTree to find the other columns, based on id .如果您需要其他列,则id (在此示例中)用于根据id进入数据的 BTree 以查找其他列。

"Full table scan" or "Full index scan" “全表扫描”或“全索引扫描”

If no index (PK, nor secondary) is useful locating the requested row(s), the query will do a "full table scan", checking each row for whether it is relevant.如果没有索引(PK,也不是辅助索引)用于定位请求的行,则查询将执行“全表扫描”,检查每一行是否相关。 Similarly, it may use a "full index scan" when a "covering" index is being used.同样,当使用“覆盖”索引时,它可能会使用“完整索引扫描”。

Continuing with the example above (and assuming another column x not in any index),继续上面的示例(并假设另一列x不在任何索引中),

SELECT *        FROM t WHERE id=5;   -- point query
SELECT COUNT(*) FROM t WHERE foo=5;  -- covering
SELECT bar      FROM t WHERE foo=5;  -- covering
SELECT x        FROM t WHERE foo=5;  -- well indexed (but not covering)
SELECT COUNT(*) FROM t WHERE bar=5;  -- full index scan (covering but slow)
SELECT *        FROM t WHERE bar=5;  -- full index scan (plus lookup)
SELECT COUNT(*) FROM t WHERE x=5;    -- full table scan
SELECT *        FROM t WHERE x=5;    -- full table scan

(These examples are ordered, fastest first.) (这些示例是有序的,最快的在前。)

SELECT COUNT(*) ... returns 1 row. SELECT COUNT(*) ...返回 1 行。 SELECT * ... potentially returns many rows, so potentially slower. SELECT * ...可能会返回许多行,因此可能会更慢。

Your optimized query would be one that includes the where clause FIRST, then secondarily the order by.您的优化查询将首先包含 where 子句,然后是 order by。 So I would have an index on所以我会有一个索引

( COMPANYID, LOGINCREDENTIALID, DATE, ID )

Company and credentials covers the where clause.公司和证书涵盖 where 子句。 Then the date and ID for the order by clause.然后是 order by 子句的日期和 ID。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM