简体   繁体   中英

MySQL indexing columns vs joining tables

I am trying to figure out the most efficient way to extract values from database that has the structure similar to this:

table test:

int id (primary, auto increment)
varchar(50) stuff,
varchar(50) important_stuff;

where I need to do a query like

select * from test where important_stuff like 'prefix%';

The size of the entire table is approximately 10 million rows, however there are only about 500-1000 distinct values for important_stuff. My current solution is indexing important_stuff however the performance is not satisfactory. Will it be better to create a separate table that will match distinct important_stuff to a certain id, which will be stored in the 'test' table and then do

(select id from stuff_lookup where important_stuff like 'prefix%') a join select * from test b where b.stuff_id=a.id

or this:

select * from test where stuff_id exists in(select id from stuff_lookup where important_stuff like 'prefix%')

What is the best way to optimize things like that?

I'm not MySQL user but I made some tests on my local database. I've added 10 millions rows as you wrote and distinct datas from third column are loaded quite fast. These are my results.

mysql> describe bigtable;
+-----------------+-------------+------+-----+---------+----------------+
| Field           | Type        | Null | Key | Default | Extra          |
+-----------------+-------------+------+-----+---------+----------------+
| id              | int(11)     | NO   | PRI | NULL    | auto_increment |
| stuff           | varchar(50) | NO   |     | NULL    |                |
| important_stuff | varchar(50) | NO   | MUL | NULL    |                |
+-----------------+-------------+------+-----+---------+----------------+
3 rows in set (0.03 sec)

mysql> select count(*) from bigtable;
+----------+
| count(*) |
+----------+
| 10000089 |
+----------+
1 row in set (2.87 sec)

mysql> select count(distinct important_stuff) from bigtable;
+---------------------------------+
| count(distinct important_stuff) |
+---------------------------------+
|                            1000 |
+---------------------------------+
1 row in set (0.01 sec)

mysql> select distinct important_stuff from bigtable;
....
| is_987          |
| is_988          |
| is_989          |
| is_99           |
| is_990          |
| is_991          |
| is_992          |
| is_993          |
| is_994          |
| is_995          |
| is_996          |
| is_997          |
| is_998          |
| is_999          |
+-----------------+
1000 rows in set (0.15 sec)

Important information is that I refreshed statistics on this table (before this operation I needed ~10 seconds to load these data).

mysql> optimize table bigtable;

How big is innodb_buffer_pool_size ? How much RAM is available? The former should be about 70% of the latter. You'll see in a minute why I bring up this setting.

Based on your 3 suggested SELECTs, the original one will work as good as the two complex ones. In some other case, the complex formulation might work better.

INDEX(important_stuff) is the 'best' index for

select * from test where important_stuff like 'prefix%';

Now, let's study how that query works with that index:

  1. Reach into the BTree index, starting at 'prefix'. (Effort: Virtually instantaneous)
  2. Scan forward for, say, 1000 entries. That will be about 10 InnoDB blocks (16KB each). Each entry will have the PRIMARY KEY ( id ). (Effort: <= 10 disk hits)
  3. For each entry, look up the row (so you can get "*"). That's 1000 PK lookups in the BTree that contains both the PK and the data. At best, they might all be in 10 blocks. At worst, they could be in 1000 separate blocks. (Effort: 10-1000 blocks)

Total Effort: ~1010 blocks (worst case).

A standard spinning disk can handle ~100 reads/second. So. we are looking at 10 seconds.

Now, run the query again. Guess what; all those blocks are now in RAM (cached in the "buffer_pool", which is hopefully big enough for all of them). And it runs in less than 1 second.

OPTIMIZE TABLE was not necessary! It was not a statistics refresh, but rather caching that sped up the query.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM