简体   繁体   中英

How to properly index, and chose the best primary key for MySQL InnoDB table

This is my fist time with big MySQL tables, and i have a couple of questions about the speed of a search.

I have a table with 100 million entries in a MySQL table. The table now look like this:

+-----------+--------------+------+-----+---------+-------+
| Field     | Type         | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+-------+
| Accession | char(10)     | NO   | PRI | NULL    |       |
| DB        | char(6)      | NO   |     | NULL    |       |
| Organism  | varchar(255) | NO   |     | NULL    |       |
| Gene      | varchar(255) | NO   |     | NULL    |       |
| Name      | varchar(255) | NO   |     | NULL    |       |
| Header    | text         | NO   |     | NULL    |       |
| Sequence  | text         | NO   |     | NULL    |       |
+-----------+--------------+------+-----+---------+-------+

with indexes like this:

+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| uniprot |          0 | PRIMARY    |            1 | Accession   | A         |    94275840 |     NULL | NULL   |      | BTREE      |         |               |
| uniprot |          1 | main_index |            1 | Accession   | A         |    94275840 |     NULL | NULL   |      | BTREE      |         |               |
| uniprot |          1 | main_index |            2 | DB          | A         |    94275840 |     NULL | NULL   |      | BTREE      |         |               |
| uniprot |          1 | main_index |            3 | Organism    | A         |    94275840 |      191 | NULL   |      | BTREE      |         |               |
| uniprot |          1 | main_index |            4 | Gene        | A         |    94275840 |      191 | NULL   |      | BTREE      |         |               |
| uniprot |          1 | main_index |            5 | Name        | A         |    94275840 |      191 | NULL   |      | BTREE      |         |               |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

My question is about the efficiency of this. The searces i use are very simple, but i need the answer really fast. For 80% of the times i use Accession as a query and i want the sequence back.

select sequence from uniprot where accession="q32p44";
...
1 row in set (0.06 sec)

For 10% of the times i search for a "Gene" and 10% of the time i search for an Organism.

The table is unique for "Accession".

My questions are:

Can i make this table more efficient (search time wise) anyhow?

Is the indexing good?

Do i speed up the search time by making a multiple keyed primary key like (Accession, Gene, Organism)?

Thanks a lot!

EDIT1:

As requested in the comments:

mysql> show create table uniprot;
+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table   | Create Table                                                                                                                                                                                                                                                                                                                                                                                    |
+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| uniprot | CREATE TABLE `uniprot` (
  `Accession` char(10) NOT NULL,
  `DB` char(6) NOT NULL,
  `Organism` varchar(255) NOT NULL,
  `Gene` varchar(255) NOT NULL,
  `Name` varchar(255) NOT NULL,
  `Header` text NOT NULL,
  `Sequence` text NOT NULL,
  PRIMARY KEY (`Accession`),
  KEY `main_index`              (`Accession`,`DB`,`Organism`(191),`Gene`(191),`Name`(191))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 |
+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Don't use "prefix" indexing, it almost never does as well as you might expect.

CHAR(10) with utf8mb4 means that you are taking 40 bytes always. accession="q32p44" implies VARCHAR and ascii would be better. With those changes, I would not bother switching to a 'surrogate' key. Consider the same issue for DB .

With PRIMARY KEY(Accession) and InnoDB, there is no advantage in having KEY main_index (Accession, ...) . Drop that KEY .

What is Sequence ? If it is a text string with only 4 different letters, then it should be highly compressible. And, with 100M rows, shrinking the disk footprint could lead to a noticeable speedup. I would COMPRESS it in the client and store it into a BLOB .

Do you really need 255 in varchar(255) ? Please shrink to something 'reasonable' for the data. That way, we can reconsider what index(es) to add, without using prefixing.

select sequence from uniprot where accession="q32p44";

works very efficiently with PRIMARY KEY(accession)

select sequence from uniprot where accession="q32p44" AND gene = '...';

also works efficiently with that PK. It will find the one row for q32p44 and then simply check that gene matches; then deliver 0 or 1 row.

select sequence from uniprot where gene = '...';

would benefit from INDEX(gene) . Similarly for Organism .

How big is the table (in GB)? What is the value of innodb_buffer_pool_size ? How much RAM do you have? If the table is a lot bigger than the buffer pool, a random "point query" ( WHERE accession = constant ) will typically take one disk hit. To discuss other queries, please show us the SELECT .

Edit

With 100M rows, shrinking the disk footprint is important for performance. There are multiple ways to do it. I want to focus on (1) Shrink the size of each column; (2) Avoid implicit overhead in indexes.

Each secondary key implicitly includes the PRIMARY KEY . So, if there are 3 indexes, there are 3 copies of the PK. That means that the size of the PK is especially important.

I'm recommending something like

CREATE TABLE `uniprot` (
  `Accession` VARCHAR(10) CHARACTER SET ascii NOT NULL,
  `DB` VARCHAR(6) NOT NULL,
  `Organism` varchar(100) NOT NULL,
  `Gene` varchar(100) NOT NULL,
  `Name` varchar(100) NOT NULL,
  `Header` text NOT NULL,
  `Sequence` text NOT NULL,
  PRIMARY KEY (`Accession`),
  INDEX(Gene),   -- implicitly (Gene, Accession)
  INDEX(Name)    -- implicitly (Organism, Accession)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

And your main queries are

SELECT Sequence FROM uniprot WHERE Accession = '...';
SELECT Sequence FROM uniprot WHERE Gene = '...';
SELECT Sequence FROM uniprot WHERE Organism = '...';

If Accession is really variable length and shorter than to and ascii, then what I suggest brings the length down from 40 bytes * 3 occurrences * 100M rows = 12GB, just for the copies of Accession, down to perhaps 2GB. I think the savings of 10GB is worth it. Going to BIGINT would be also be about 2GB (no further savings); going to INT would be about 1GB (more savings, but not much).

Shrinking Gene and Organism to 'reasonable' sizes (if practical) avoids the need for using prefixing, hence allowing the index to work better. But, you can argue that maybe prefixing will work "well enough" in INDEX(Gene(11)) . Let's get some numbers to make the argument one way or another. What is the average length of Gene (and Organism )? How many initial characters in Gene are usually sufficient to identify a Gene?

Another space question is whether there are a lot of duplicates in Gene and/or Organism. If so, then "normalizing" those fields would be warranted. Ditto for Name, Header, and Sequence.

The need for a JOIN (or two) if you make surrogates for Accession and/or Gene is only a slight bit of overhead, not enough to worry about.

First off, as mentioned in the comments I wouldn't use a natural key (Accession), I would opt for a surrogate key (Id), however with 100M rows, that would be a painful alter during which the table will be locked.

With that being said, Accession is already indexed b/c it's a Primary Key so for simple queries, you can't optimize further:

select sequence from uniprot where accession="q32p44";

If doing look-ups against other columns then your best bet is to add separate indices for each column:

ALTER TABLE uniprot ADD INDEX (Gene(10)), ADD KEY (Organism(10));

The goal is to index the uniqueness of the values (cardinality), so if you have a lot of values with somethingsomething1, somethingsomething2, somethingsomething3 then it would be best to go with a prefix of 18+ but no larger than say 30.

Per MySQL docs :

If names in the column usually differ in the first 10 characters, this index should not be much slower than an index created from the entire name column. Also, using column prefixes for indexes can make the index file much smaller, which could save a lot of disk space and might also speed up INSERT operations.

So the goal is to index the uniqueness (cardinality) but without inflating size on disk.

I would also remove that main_index index, as I don't see the benefit as you are not searching on all those columns at the same time, and due to length, will slow down your writes with little gain on the reads.

Be sure to test before you run anything on production. Perhaps get a small sampling (1-5% of the dataset) and prefix your queries you plan on running with explain to see how MySQL will execute them.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM