简体   繁体   English

索引列和非索引列研究

[英]Indexed column and not indexed column research

I generated separate MySQL Innodb tables with 2000, 5000, 10000, 50000, 10000, 20000, 50000, 100 000, 200 000 elements(with help of php loop and insert query).我生成了单独的 MySQL Innodb 表,其中包含 2000、5000、10000、50000、10000、20000、50000、100 000、200 000 个元素(借助 php 循环和插入查询)。 Each table has two columns: id(Primary Key INT autoincrement), number(INT UNIQUE KEY).每个表有两列:id(Primary Key INT autoincrement)、number(INT UNIQUE KEY)。 Then I did the same but this time I generated similar tables where number column doesn't have an INDEX .I generated tables in a such way: the value of column number is equal to value of index + 2: first element == 3, 1000th element is 1002 and so on.然后我做了同样的但这次我生成了类似的表,其中number列没有 INDEX 。我以这样的方式生成表:列的值等于索引值 + 2:第一个元素 == 3,第 1000 个元素是 1002,依此类推。 I wanted to test a query like that, because It will be used in my application:我想测试这样的查询,因为它将在我的应用程序中使用:

SELECT count(number) FROM number_two_hundred_I WHERE number=200002;

After generating data for these tables I wanted to test time for the worst case queries.在为这些表生成数据后,我想测试最坏情况查询的时间。 I used SHOW PROFILES for it.我使用了显示配置文件。 I made an assumption that the worst case query would correspond to the element with value of column number to 1002, 2002, and so on, so here are all the queries that I tested and the time(evaluated by SHOW PROFILES):我假设最坏情况的查询将对应于列值为 1002、2002 等的元素,所以这里是我测试的所有查询和时间(由 SHOW PROFILES 评估):

SELECT count(number) FROM number_two_thousand_I WHERE number=2002;
// for tables with indexed column number I used **suffix _I** in the end 
// of name of the table. Here is the time for it 0.00099250
SELECT count(number) FROM number_two_thousand WHERE number=2002;
// column number is not indexed when there is no **suffix _I** 
// time for this one is 0.00226275
SELECT count(number) FROM number_five_thousand_I WHERE number=5002;
// 0.00095600
SELECT count(number) FROM number_five_thousand WHERE number=5002;
// 0.00404125

So here are the results:结果如下:

  1. 2000 el - indexed 0.00099250 not indexed - 0.00226275 2000 el - 索引 0.00099250 未索引 - 0.00226275

  2. 5000 el - indexed 0.00095600 not indexed - 0.00404125 5000 el - 索引 0.00095600 未索引 - 0.00404125

  3. 10000 el - indexed 0.00156900 not indexed - 0.00761750 10000 el - 索引 0.00156900 未索引 - 0.00761750

  4. 20000 el - indexed 0.00155850 not indexed - 0.01452820 20000 el - 索引 0.00155850 未索引 - 0.01452820
  5. 50000 el - indexed 0.00051100 not indexed - 0.04127450 50000 el - 索引 0.00051100 未索引 - 0.04127450
  6. 100000 el indexed 0.00121750 not indexed - 0.07120075 100000 el 索引 0.00121750 未索引 - 0.07120075
  7. 200000 el indexed 0.00095025 not indexed - 0.11406950 200000 el 索引 0.00095025 未索引 - 0.11406950

Here is infographic for that.这是信息图 It shows how number of elements depends on the worst case time of query for indexed/not indexed column.它显示了元素数量如何取决于索引/未索引列的最坏情况查询时间。 Indexed is red color.索引是红色。 When I tested speed, I typed the same query in mysql console 2 times , because I figured out that when you make query for the 1st time, sometimes query for not indexed column can be even a bit faster, than for indexed one.当我测试速度时,我在 mysql 控制台中输入了 2 次相同的查询,因为我发现当您第一次进行查询时,有时查询未索引列甚至比索引列还要快一点。 Question is: why this type of query for 200000 elements takes sometimes less time, than the same query for 100000 elements when column number is indexed.问题是:为什么这种对 200000 个元素的查询有时比对列号进行索引时对 100000 个元素的相同查询花费的时间更少。 You can see that there are other unpredictable for me results.你可以看到还有其他对我来说不可预测的结果。 I ask this, because when column number is not indexed, the results are quite predictable: 200000 el time is always bigger than 100000. Please tell me what I'm doing wrong when trying to make research about UNIQUE indexed column.我问这个,因为当列号没有被索引时,结果是可以预测的:200000 el 时间总是大于 100000。请告诉我在尝试对 UNIQUE 索引列进行研究时我做错了什么。

在未索引的情况下,它始终是全表扫描,因此时间与行号很好地吻合,如果它被索引,您正在测量索引查找时间,这在您的情况下是恒定的(小数字,小偏差)

It is not the "worst" case.这还不是“最坏”的情况。

  • Make the UNIQUE key random instead of being in lock step with the PK.使UNIQUE密钥随机而不是与 PK 处于锁定步骤。 An example of such is UUID() .这样的一个例子是UUID()
  • Generate enough rows so that the table and index(es) cannot fit in the buffer_pool.生成足够多的行,以便表和索引无法放入 buffer_pool。

If you both of those you will eventually see the performance slow down significantly.如果两者兼而有之,您最终会看到性能显着下降。

UNIQUE keys have the following impact on INSERTs : The uniqueness constraint is checked before returning to the client. UNIQUE键对INSERTs有以下影响:返回给客户端之前检查唯一性约束。 For a non-UNIQUE index, the work to insert into the index's BTree can (and is) delayed.对于非 UNIQUE 索引,插入索引的 BTree 的工作可以(并且已经)延迟。 (cf "Change buffer). With no index on the second column, there is even less work to do. (参见“更改缓冲区”)。由于第二列上没有索引,因此要做的工作更少。

WHERE number=2002 -- WHERE number=2002 --

  • With UNIQUE(number) -- Drill down the BTree.使用UNIQUE(number) -- 深入 BTree。 Very fast, very efficient.非常快,非常有效率。
  • With INDEX(number) -- Drill down the BTree.使用INDEX(number) -- 深入 BTree。 Very fast, very efficient.非常快,非常有效率。 However it is slightly slower since it can't assume there is only one such row.但是它稍微慢一些,因为它不能假设只有一个这样的行。 That is, after finding the right spot in the BTree, it will scan forward (very efficient) until it finds a value other than 2002.也就是说,在 BTree 中找到正确的位置后,它将向前扫描(非常有效),直到找到 2002 以外的值。
  • With no index on number -- Scan the entire table.没有number索引——扫描整个表。 So the cost depends on table size, not the value of number .所以成本取决于表的大小,而不是number的值。 It has no clue if 2002 exists anywhere in the table, or how many times.它不知道 2002 是否存在于表中的任何位置,或者存在多少次。 If you plot the times you got, you will see that it is rather linear.如果你绘制你得到的时间,你会发现它是相当线性的。

I suggest you use log-log 'paper' for your graph.我建议您在图表中使用 log-log 'paper'。 Anyway, note how linear the non-indexed case is.无论如何,请注意非索引情况的线性程度。 And the indexed case is essentially constant.并且索引的情况基本上是恒定的。 Finding number=200002 is just as cheap as finding number=2002.查找 number=200002 与查找 number=2002 一样便宜。 This applies for UNIQUE and INDEX .这适用于UNIQUEINDEX (Actually, there is a very slight rise in the line because a BTree is really O(log n), not O(1). For 2K rows, there are probably 2 levels in the BTree; for 200K, 3 levels.) (实际上,由于 BTree 确实是 O(log n),而不是 O(1),因此行中有非常小的上升。对于 2K 行,BTree 中可能有 2 个级别;对于 200K,则为 3 个级别。)

The Query cache can trip you up in timings (if it is turned on).查询缓存可以在时间上绊倒您(如果它已打开)。 When timing, do SELECT SQL_NO_CACHE ... to avoid the QC.计时时,请执行SELECT SQL_NO_CACHE ...以避免 QC。 If the QC is on and applies, then the second and subsequent runs of the identical query will take very close to 0.000 seconds.如果 QC 开启并应用,那么相同查询的第二次和后续运行将花费非常接近 0.000 秒。

Those timings that varied between 0.5ms and 1.2ms -- chalk it up to the phase of the moon.那些在 0.5 毫秒和 1.2 毫秒之间变化的时间 - 将其归结为月相。 Seriously, any timing below 10ms should not be trusted.说真的,任何低于 10 毫秒的时间都不应该被信任。 This is because of all the other things that may be happening on the computer at the same time.这是因为计算机上可能同时发生的所有其他事情。 You can temper it somewhat by averaging multiple runs -- being sure to avoid (1) the Query cache, and (2) I/O.您可以通过平均多次运行来稍微调整它——确保避免 (1) 查询缓存和 (2) I/O。

As for I/O... This gets back to my earlier comment about what may happen when the table (and/or index) is bigger than can be cached in RAM.至于 I/O...这又回到了我之前关于当表(和/或索引)大于 RAM 中缓存时可能发生的情况的评论。

  • When smaller than RAM, the first run is likely to fetch stuff from disk.当小于 RAM 时,第一次运行可能会从磁盘获取内容。 The second and subsequent runs are likely to be faster and consistent.第二次和后续的运行可能会更快、更一致。
  • Whem bigger than RAM, all runs may need to hit the disk.比 RAM 大,所有运行都可能需要访问磁盘。 Hence, all may be slow, and perhaps more flaky than the variations you found.因此,一切都可能很慢,而且可能比您发现的变化更不稳定。

Your tags are, technically, incorrect.从技术上讲,您的标签不正确。 Most of MySQL's indexes are BTrees (actually B+Trees), not Binary Trees. MySQL的索引大多是BTrees(实际上是B+Trees),而不是二叉树。 (Sure, there is a lot of similarity, and many of the principles are shared.) (当然,有很多相似之处,许多原则是共享的。)

Back to your research goal.回到你的研究目标。

  • Assume there is 'background noise' that is messing with your figures.假设有“背景噪音”扰乱了您的数字。
  • Make your tests non-trivial (eg the non-indexed case) so that it overwhelms the noise, or使您的测试不平凡(例如非索引情况),以便它压倒噪音,或
  • Repeat the timings to mask the issue.重复计时以掩盖问题。 And be sure to ignore the first run.并且一定要忽略第一次运行。

The main cost in performing any SELECT is how many rows it touches.执行任何SELECT主要成本是它接触了多少行。

  • With your UNIQUE index, it is touching 1 row.使用您的UNIQUE索引,它触及 1 行。 So expect fast and O(1) (plus noise).所以期待快速和 O(1) (加上噪音)。
  • Without an index, it is touching N rows for an N-row table.如果没有索引,它会接触 N 行表的 N 行。 So expect O(N).所以期望 O(N)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM