简体   繁体   中英

Query to show the limitations of B-Tree Index in mysql database

I want to show some limitations on certain index types in mysql. I've read that using B-Tree index on columns that holds Boolean data types would be ineffective because any search query with this index type either for a True or False outcome would have to perform a full table scan. How can I show this in a sql query? I've tried the following query but it can't prove that the claim above is true, the issue is if I change the gender in my query to 'F', there's no advantage of using the index that I created. Please can someone show me a query that can prove that B-Tree index is ineffective on Boolean columns? I'm using the popular Employees database. Please note (the emp_no column holds unique values but I've added the gender column through the WHERE clause). Thanks for any help.

use employees;

CREATE INDEX indx_emp on employees(emp_no);

SELECT 
    *
FROM
    employees USE INDEX (indx_emp)
WHERE
    emp_no BETWEEN 10980 AND 100000
        AND gender = 'M'
ORDER BY birth_date;  

Using an index on boolean columns is just as effective as an index on columns of any other data type.

If you hear someone claim that an index on a boolean is not effective, they're making the assumption that the true and false values each occur on close to 50% of the rows.

In fact, if you have a column that is 90% true and 10% false, then searching an index for the rows with false will benefit from the index. But if you search the index for rows with true, then reading the index is needless overhead. You would have been better off just doing a table-scan.

This works the same way for any other data type. For example, if you search for some integer value (or range of values) that matches more than 20% of the rows in the table, the optimizer is likely to skip the index and just do a table-scan. The extra work it would take to read the index, dereference the pointers to rows, and then read the rows, is considered more costly than just scanning all the rows in the table and throwing out those that don't match your condition.

If you search an integer column that has only two distinct values, and each of the values are found on about 50% of the rows in table, this is equivalent to searching for booleans where true and false occur in about 50% of the table, respectively. The index wouldn't be very selective in either case, so it's likely to be skipped by the optimizer. The same is true for varchar or datetime, or any other indexed column type.

I'm not providing a specific query as you requested, because the query you showed already could demonstrate it. Whether the index is used depends on the data and how selective the values you are searching for in that index.

You should analyze the query with EXPLAIN and compare the optimizer's choice and how many rows it estimates it will read.


The 20% figure is not an official threshold. It's not documented, it's just something I have observed. It might vary based on what data types you are searching, or the implementation might change in some other version of the software.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM