简体   繁体   中英

MySql Indexing Strategy With Multiple Shared Columns

We have a database table which stores browser data for visitors, broken down by multiple different subtypes. For simplicity, let's use the table schema below. The querying will basically be on any single id column, the metric column, the timestamp column (stored as seconds since epoch), and one of the device, browser, or os columns.

We are going to performance test the star vs snowflake schema (where all of the ids go into a single column, but then an additional column id_type is added to determine which type of identifier it is) for this table, but as long as the star schema (which is how it is now) is within 80% of the snowflake performance, we are going to keep it since it will make our load process much easier. Before I do that however, I want to make sure the indexes are optimized on the star schema.

create table browser_data (
id_1 int,
id_2 int,
id_3 int,
id_4 int,
metric varchar(20),
browser varchar(20),
device varchar(20),
os varchar(20),
timestamp bigint
)

Would it be better to create individual indexes on just the id columns, or also include the metric and timestamp columns in those indexes as well?

Do not normalize "continuous" values, such as DATETIME , FLOAT , INT . Do leave the values in the main table.

When you move the value to other table(s), especially "snowflake", it makes querying based on the values somewhere between a little slower and a lot slower. This especially happens when you need to filter on more than one metric that is not in the main table. Either of these perform very poorly because of "snowflake" or "over-normalization":

WHERE a.x = 123 AND b.y = 345

ORDER BY a.x, b.y

As for what indexes to create -- that depends entirely on the queries you need to perform. So, I strongly recommend you sketch out the likely SELECTs based on your tentative CREATE TABLEs .

INT is 4 bytes. TIMESTAMP is 5, FLOAT is 4, etc. That is, normalizing such things are also inefficient on space.

More

When doing JOINs , the Optimizer will almost always start with one table, then move on to the another table, etc. (See "Nested Loop Join".)

For example (building on the above 'code'), when 2 columns are normalized, and you are testing on the values, you do not have two ids in hand, you only have the two values. This makes the query execution very inefficient. For

SELECT ...
    FROM main
    JOIN a  USING(a_id)
    JOIN b  USING(b_id)
    WHERE a.x = 123 AND b.y = 345

The following is very likely to be the 'execution plan':

  1. Reach into a to find the row(s) with x=123; get the id(s) for those rows. This may include many rows that are yet to be filtered by by . a needs INDEX(x)
  2. Go back to the main table, looking up rows with those id(s). main needs INDEX(a_id) . Again, more rows than necessary may be hauled around.
  3. Only now, do you get to b (using b_id ) to check for y=345 ; toss the unnecessary rows you have been hauling around. b needs INDEX(b_id)

Note my comment about "haul around". Blindly using * (in SELECT * ) adds to the problem -- all the columns are being hauled around while performing the steps.

On the other hand... If x and y were in the main table, then the code works like:

WHERE main.x = 123
  AND main.y = 345

only needs INDEX(x,y) (in either order). And it quickly locates exactly the rows desired.

In the case of ORDER BY ax, by , it cannot use any index on any table. So the query must create a tmp table, sort it, then deliver the rows in the desired order.

But if x and y are in the same table, then INDEX(x,y) (in that order) may be useful for ORDER BY x,y and avoid the tmp table and the sort.

With a single table, the Optimizer might use an index for WHERE , or it might use an index for ORDER BY , depending on the phase of the moon. In some cases, one index can be used for both -- this is optimal.

Another note: If you also have LIMIT 10 ,... If the sort is avoided, then only 10 rows need to be looked at, not the entire set from the WHERE .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM