简体   繁体   English

MySQL 聚集索引与非聚集索引性能

[英]MySQL Clustered vs Non Clustered Index Performance

I'm running a couple tests on MySQL Clustered vs Non Clustered indexes where I have a table 100gb_table which contains ~60 million rows:我在 MySQL Clustered vs Non Clustered 索引上运行了几个测试,其中我有一个表100gb_table包含约 6000 万行:

100gb_table schema:
CREATE TABLE 100gb_table (
    id int PRIMARY KEY NOT NULL AUTO_INCREMENT,
    c1 int,
    c2 text,
    c3 text,
    c4 blob NOT NULL,
    c5 text,
    c6 text,
    ts timestamp NOT NULL default(CURRENT_TIMESTAMP)
);

and I'm executing a query that only reads the clustered index:我正在执行一个只读取聚集索引的查询:

SELECT id FROM 100gb_table ORDER BY id;

I'm seeing that it takes almost an ~55 min for this query to complete which is strangely slow.我看到这个查询需要大约55 分钟才能完成,这非常慢。 I modified the table by adding another index on top of the Primary Key column and ran the following query which forces the non-clustered index to be used:我通过在主键列顶部添加另一个索引来修改表,并运行以下查询,强制使用非聚集索引:

SELECT id FROM 100gb_table USE INDEX (non_clustered_key) ORDER BY id;

This finished in <10 minutes , much faster than reading with the clustered index.这在<10 分钟内完成,比使用聚集索引读取要快得多。 Why is there such a large discrepancy between these two?为什么这两者之间会有如此大的差异? My understanding is that both indexes store the index column's values in a tree structure, except the clustered index contains table data in the leaf nodes so I would expect both queries to be similarly performant.我的理解是,两个索引都将索引列的值存储在树结构中,除了聚集索引包含叶节点中的表数据,因此我希望两个查询具有相似的性能。 Could the BLOB column possibly be distorting the clustered index structure? BLOB 列可能会扭曲聚集索引结构吗?

The answer comes in how the data is laid out.答案在于数据的布局方式。

The PRIMARY KEY is "clustered" with the data; PRIMARY KEY与数据“聚集”在一起; that is, the data is order ed by the PK in a B+Tree structure.也就是说,数据是由PK在B+Tree结构中排序的。 To read all of the ids , the entire BTree must be read.要读取所有ids ,必须读取整个 BTree。

Any secondary index is also in a B+Tree structure, but it contains (1) the columns of the index, and (2) any other columns in the PK.任何二级索引也是 B+Tree 结构,但它包含 (1) 索引的列,以及 (2) PK 中的任何其他列。

In your example (with lots of [presumably] bulky columns), the data BTree is a lot bigger than the secondary index (on just id ).在您的示例中(有很多 [大概] 庞大的列),数据 BTree 比二级索引大得多(仅在id上)。 Either test probably required reading all the relevant blocks from the disk.任一测试都可能需要从磁盘读取所有相关块。

A side note... This is not as bad as it could be.附注......这并不像它可能的那么糟糕。 There is a limit of about 8KB on how big a row can be.一行的大小有大约 8KB 的限制。 TEXT and BLOB columns, when short enough, are included in that 8KB.足够短的TEXTBLOB列包含在该 8KB 中。 But when one is bulky, it is put in another place, leaving behind a 'pointer' to the text/blob.但是当一个笨重时,它会被放在另一个地方,留下一个指向文本/blob的“指针”。 Hence, the main part of the data BTree is smaller than it might be if all the text/blob data were included directly.因此,数据 BTree 的主要部分比直接包含所有文本/blob 数据的情况要小。

Since SELECT id FROM tbl is a mostly unnecessary query, the design of InnoDB does not worry about the inefficiency you discovered.由于SELECT id FROM tbl是一个大部分不必要的查询,InnoDB 的设计并不担心您发现的低效率。

Tack on ORDER BY or WHERE , etc, and there are many different optimizations that could into play.添加ORDER BYWHERE等,并且有许多不同的优化可以发挥作用。 You might even find that INDEX(c1) will let your query run in not much more than 10 minutes.您甚至可能会发现INDEX(c1)会让您的查询在不超过 10 分钟的时间内运行。 (I think I have given you all the clues for 'why'.) (我想我已经为你提供了“为什么”的所有线索。)

Also, if you had done SELECT * FROM tbl , it might have taken much longer than 55 minutes.另外,如果你已经完成了SELECT * FROM tbl ,它可能需要比 55 分钟更长的时间。 This is because of having extra [random] fetches to get the texts/blobs from the "off-record" storage.这是因为有额外的 [随机] 提取以从“非记录”存储中获取文本/blob。 And from the network time to shovel far more data.而且从网络上挖出的数据要多得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM