简体   繁体   English

MySQL索引列与连接表

[英]MySQL indexing columns vs joining tables

I am trying to figure out the most efficient way to extract values from database that has the structure similar to this: 我试图找出从具有类似结构的数据库中提取值的最有效方法:

table test: 表测试:

int id (primary, auto increment)
varchar(50) stuff,
varchar(50) important_stuff;

where I need to do a query like 我需要像这样的查询

select * from test where important_stuff like 'prefix%';

The size of the entire table is approximately 10 million rows, however there are only about 500-1000 distinct values for important_stuff. 整个表的大小约为1000万行,但是重要数据只有大约500-1000个不同的值。 My current solution is indexing important_stuff however the performance is not satisfactory. 我目前的解决方案是索引important_stuff但是表现并不尽如人意。 Will it be better to create a separate table that will match distinct important_stuff to a certain id, which will be stored in the 'test' table and then do 最好创建一个单独的表,该表将不同的important_stuff与某个ID匹配,该ID将存储在“测试”表中,然后执行

(select id from stuff_lookup where important_stuff like 'prefix%') a join select * from test b where b.stuff_id=a.id

or this: 或这个:

select * from test where stuff_id exists in(select id from stuff_lookup where important_stuff like 'prefix%')

What is the best way to optimize things like that? 优化这样的事情的最佳方法是什么?

I'm not MySQL user but I made some tests on my local database. 我不是MySQL用户,但是我在本地数据库上进行了一些测试。 I've added 10 millions rows as you wrote and distinct datas from third column are loaded quite fast. 在您编写时,我已经添加了1000万行,并且第三列中的不同数据加载得非常快。 These are my results. 这些是我的结果。

mysql> describe bigtable;
+-----------------+-------------+------+-----+---------+----------------+
| Field           | Type        | Null | Key | Default | Extra          |
+-----------------+-------------+------+-----+---------+----------------+
| id              | int(11)     | NO   | PRI | NULL    | auto_increment |
| stuff           | varchar(50) | NO   |     | NULL    |                |
| important_stuff | varchar(50) | NO   | MUL | NULL    |                |
+-----------------+-------------+------+-----+---------+----------------+
3 rows in set (0.03 sec)

mysql> select count(*) from bigtable;
+----------+
| count(*) |
+----------+
| 10000089 |
+----------+
1 row in set (2.87 sec)

mysql> select count(distinct important_stuff) from bigtable;
+---------------------------------+
| count(distinct important_stuff) |
+---------------------------------+
|                            1000 |
+---------------------------------+
1 row in set (0.01 sec)

mysql> select distinct important_stuff from bigtable;
....
| is_987          |
| is_988          |
| is_989          |
| is_99           |
| is_990          |
| is_991          |
| is_992          |
| is_993          |
| is_994          |
| is_995          |
| is_996          |
| is_997          |
| is_998          |
| is_999          |
+-----------------+
1000 rows in set (0.15 sec)

Important information is that I refreshed statistics on this table (before this operation I needed ~10 seconds to load these data). 重要信息是,我刷新了该表的统计信息(在执行此操作之前,我需要约10秒才能加载这些数据)。

mysql> optimize table bigtable;

How big is innodb_buffer_pool_size ? innodb_buffer_pool_size How much RAM is available? 有多少可用RAM? The former should be about 70% of the latter. 前者应占后者的70%。 You'll see in a minute why I bring up this setting. 您稍后将看到为什么我提出此设置。

Based on your 3 suggested SELECTs, the original one will work as good as the two complex ones. 根据您建议的3个SELECT,原始的SELECT和两个复杂的SELECT一样好。 In some other case, the complex formulation might work better. 其他情况下,复杂的公式可能会更好。

INDEX(important_stuff) is the 'best' index for INDEX(important_stuff)是“最佳”索引

select * from test where important_stuff like 'prefix%';

Now, let's study how that query works with that index: 现在,让我们研究该查询如何与该索引一起工作:

  1. Reach into the BTree index, starting at 'prefix'. 从“前缀”开始进入BTree索引。 (Effort: Virtually instantaneous) (工作量:实际上是瞬时的)
  2. Scan forward for, say, 1000 entries. 向前扫描,例如1000个条目。 That will be about 10 InnoDB blocks (16KB each). 那将是大约10个InnoDB块(每个16KB)。 Each entry will have the PRIMARY KEY ( id ). 每个条目将具有PRIMARY KEY( id )。 (Effort: <= 10 disk hits) (努力:<= 10个磁盘命中)
  3. For each entry, look up the row (so you can get "*"). 对于每个条目,请查找该行(以便获得“ *”)。 That's 1000 PK lookups in the BTree that contains both the PK and the data. 那就是同时包含PK和数据的BTree中的1000个PK查找。 At best, they might all be in 10 blocks. 充其量,它们可能全部位于10个街区中。 At worst, they could be in 1000 separate blocks. 最坏的情况是,它们可能位于1000个单独的区块中。 (Effort: 10-1000 blocks) (工作量:10-1000格)

Total Effort: ~1010 blocks (worst case). 总精力:〜1010格(最坏的情况)。

A standard spinning disk can handle ~100 reads/second. 一个标准的旋转磁盘可以处理约100次读取/秒。 So. 所以。 we are looking at 10 seconds. 我们正在看10秒。

Now, run the query again. 现在,再次运行查询。 Guess what; 你猜怎么了; all those blocks are now in RAM (cached in the "buffer_pool", which is hopefully big enough for all of them). 所有这些块现在都在RAM中(缓存在“ buffer_pool”中, 希望对所有块都足够大)。 And it runs in less than 1 second. 而且运行时间不到1秒。

OPTIMIZE TABLE was not necessary! OPTIMIZE TABLE没有必要的! It was not a statistics refresh, but rather caching that sped up the query. 不是刷新统计信息,而是缓存加快了查询速度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM