Improve self-JOIN SQL Query performance

Question

I try to improve performance of a SQL query, using MariaDB 10.1.18 (Linux Debian Jessie).

The server has a large amount of RAM (192GB) and SSD disks.

The real table has hundreds of millions of rows but I can reproduce my performance issue on a subset of the data and a simplified layout.

Here is the (simplified) table definition:

CREATE TABLE `data` (
  `uri` varchar(255) NOT NULL,
  `category` tinyint(4) NOT NULL,
  `value` varchar(255) NOT NULL,
  PRIMARY KEY (`uri`,`category`),
  KEY `cvu` (`category`,`value`,`uri`),
  KEY `cu` (`category`,`uri`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

To reproduce the actual distribution of my content, I insert about 200'000 rows like this (bash script):

#!/bin/bash
for i in `seq 1 100000`;
do
  mysql mydb -e "INSERT INTO data (uri, category, value) VALUES ('uri${i}', 1, 'foo');"
done

for i in `seq 99981 200000`;
do
  mysql mydb -e "INSERT INTO data (uri, category, value) VALUES ('uri${i}', 2, '$(($i % 5))');"
done

So, we insert about:

100'000 rows in category 1 with a static string ("foo") as value
100'000 rows in category 2 with a number between 1 and 5 as the value
20 rows have a common "uri" between each dataset (category 1 / 2)

I always run an ANALYZE TABLE before querying.

Here is the explain output of the query I run:

MariaDB [mydb]> EXPLAIN EXTENDED
    -> SELECT d2.uri, d2.value
    -> FROM data as d1
    -> INNER JOIN data as d2 ON d1.uri  = d2.uri AND d2.category = 2
    -> WHERE d1.category = 1 and d1.value  = 'foo';
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
| id   | select_type | table | type   | possible_keys  | key     | key_len | ref               | rows  | filtered | Extra       |
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
|    1 | SIMPLE      | d1    | ref    | PRIMARY,cvu,cu | cu      | 1       | const             | 92964 |   100.00 | Using where |
|    1 | SIMPLE      | d2    | eq_ref | PRIMARY,cvu,cu | PRIMARY | 768     | mydb.d1.uri,const |     1 |   100.00 |             |
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)

MariaDB [mydb]> SHOW WARNINGS;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                                                                                              |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | select `mydb`.`d2`.`uri` AS `uri`,`mydb`.`d2`.`value` AS `value` from `mydb`.`data` `d1` join `mydb`.`data` `d2` where ((`mydb`.`d1`.`category` = 1) and (`mydb`.`d2`.`uri` = `mydb`.`d1`.`uri`) and (`mydb`.`d2`.`category` = 2) and (`mydb`.`d1`.`value` = 'foo')) |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

MariaDB [mydb]> SELECT d2.uri, d2.value FROM data as d1 INNER JOIN data as d2 ON d1.uri  = d2.uri AND d2.category = 2 WHERE d1.category = 1 and d1.value  = 'foo';
+-----------+-------+
| uri       | value |
+-----------+-------+
| uri100000 | 0     |
| uri99981  | 1     |
| uri99982  | 2     |
| uri99983  | 3     |
| uri99984  | 4     |
| uri99985  | 0     |
| uri99986  | 1     |
| uri99987  | 2     |
| uri99988  | 3     |
| uri99989  | 4     |
| uri99990  | 0     |
| uri99991  | 1     |
| uri99992  | 2     |
| uri99993  | 3     |
| uri99994  | 4     |
| uri99995  | 0     |
| uri99996  | 1     |
| uri99997  | 2     |
| uri99998  | 3     |
| uri99999  | 4     |
+-----------+-------+
20 rows in set (0.35 sec)

This query returns 20 rows in ~350ms.

It seems quite slow to me.

Is there a way to improve performance of such query? Any advice?

Answer 1

Can you try the following query?

  SELECT dd.uri, max(case when dd.category=2 then dd.value end) v2
    FROM data as dd
   GROUP by 1 
  having max(case when dd.category=1 then dd.value end)='foo' and v2 is not null;

I cannot at the moment repeat your test, but my hope is that having to scan the table just once could compensate the usage of the aggregate functions.

Edited

Created a test environment and tested some hypothesis. As of today, the best performance (for 1 million rows) has been:

1 - Adding an index on uri column

2 - Using the following query

 select d2.uri, d2.value 
   FROM data as d2 
  where exists (select 1 
                  from data d1 
                 where d1.uri  = d2.uri 
                   AND d1.category = 1 
                   and d1.value='foo') 
    and d2.category=2 
    and d2.uri in (select uri from data group by 1 having count(*) > 1);

The ironic thing is that in the first proposal I tried to minimize the access to the table and now I'm proposing three accesses.

Edited: 30/10

Ok, so I've done some other experiments and I would like to summarize the outcomes. First, I'd like to expand a bit Aruna answer: what I found interesting in the OP question, is that it is an exception to a classic "rule of thumb" in database optimization: if the # of desired results is very small compared to the dimension of the tables involved, it should be possible with the correct indexes to have a very good performance.

Why can't we simply add a "magic index" to have our 20 rows? Because we don't have any clear "attack vector".. I mean, there's no clearly selective criteria we can apply on a record to reduce significatevely the number of the target rows.

Think about it: the fact that the value must be "foo" is just removing 50% of the table form the equation. Also the category is not selective at all: the only interest thing is that, for 20 uri, they appear both in records with category 1 and 2.

But here lies the issue: the condition involves comparing two rows, and unfortunately, to my knowledge, there's no way an index (not even the Oracle Function Based Indexes) can reduce a condition that is dependant on info on multiple rows.

The conlclusion might be: if these kind of query is what you need, you should revise your data model. For example, if you have a finite and small number of categories (lets' say three=, your table might be written as:

uri, value_category1, value_category2, value_category3

The query would be:

select uri, value_category2 where value_category1='foo' and value_category2 is not null;

By the way, let's go back tp the original question. I've created a slightly more efficient test data generator ( http://pastebin.com/DP8Uaj2t ).

I've used this table:

 use mydb;
 DROP TABLE IF EXISTS data2;

 CREATE TABLE data2 
 ( 
  uri varchar(255) NOT NULL, 
  category tinyint(4) NOT NULL, 
  value varchar(255) NOT NULL, 
  PRIMARY KEY (uri,category), 
  KEY cvu (category,value,uri), 
  KEY ucv (uri,category,value), 
  KEY u (uri), 
  KEY cu (category,uri)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

The outcome is:

 +--------------------------+----------+----------+----------+
 | query_descr              | num_rows | num      | num_test |
 +--------------------------+----------+----------+----------+
 | exists_plus_perimeter    |    10000 |   0.0000 |        5 |
 | exists_plus_perimeter    |    50000 |   0.0000 |        5 |
 | exists_plus_perimeter    |   100000 |   0.0000 |        5 |
 | exists_plus_perimeter    |   500000 |   2.0000 |        5 |
 | exists_plus_perimeter    |  1000000 |   4.8000 |        5 |
 | exists_plus_perimeter    |  5000000 |  26.7500 |        8 |
 | max_based                |    10000 |   0.0000 |        5 |
 | max_based                |    50000 |   0.0000 |        5 |
 | max_based                |   100000 |   0.0000 |        5 |
 | max_based                |   500000 |   3.2000 |        5 |
 | max_based                |  1000000 |   7.0000 |        5 |
 | max_based                |  5000000 |  49.5000 |        8 |
 | max_based_with_ucv       |    10000 |   0.0000 |        5 |
 | max_based_with_ucv       |    50000 |   0.0000 |        5 |
 | max_based_with_ucv       |   100000 |   0.0000 |        5 |
 | max_based_with_ucv       |   500000 |   2.6000 |        5 |
 | max_based_with_ucv       |  1000000 |   7.0000 |        5 |
 | max_based_with_ucv       |  5000000 |  36.3750 |        8 |
 | standard_join            |    10000 |   0.0000 |        5 |
 | standard_join            |    50000 |   0.4000 |        5 |
 | standard_join            |   100000 |   2.4000 |        5 |
 | standard_join            |   500000 |  13.4000 |        5 |
 | standard_join            |  1000000 |  33.2000 |        5 |
 | standard_join            |  5000000 | 205.2500 |        8 |
 | standard_join_plus_perim |  5000000 | 155.0000 |        2 |
 +--------------------------+----------+----------+----------+

The queries used are: - query_max_based_with_ucv.sql
- query_exists_plus_perimeter.sql
- query_max_based.sql
- query_max_based_with_ucv.sql
- query_standard_join_plus_perim.sql query_standard_join.sql

The best query is still the "query_exists_plus_perimeter"that I've put after the first environment creation.

Answer 2

It is mainly due to the number of rows analysed. Even though you have tables indexed the main decision making condition "WHERE d1.category = 1 and d1.value = 'foo'" filters huge amount of rows

+------+-------------+-------+-.....-+-------+----------+-------------+
| id   | select_type | table |       | rows  | filtered | Extra       |
+------+-------------+-------+-.....-+-------+----------+-------------+
|    1 | SIMPLE      | d1    | ..... | 92964 |   100.00 | Using where |

Each and every matching row it has to read the table again which is for category 2. Since it is reading on primary key it can get the matching row directly.

On your original table check the cardinality of the combination of category and value. If it is more towards unique, you can add an index on (category, value) and that should improve the performance. If it is same like the example given you may not get any performance improvement.

Improve self-JOIN SQL Query performance

Question

2 answers

solution1
1 2016-10-24 17:19:08

solution2
0

Improve self-JOIN SQL Query performance

Question

2 answers

solution1 1 2016-10-24 17:19:08

solution2 0

solution1
1 2016-10-24 17:19:08

solution2
0