简体   繁体   English

提高自我JOIN SQL查询性能

[英]Improve self-JOIN SQL Query performance

I try to improve performance of a SQL query, using MariaDB 10.1.18 (Linux Debian Jessie). 我尝试使用MariaDB 10.1.18(Linux Debian Jessie)来提高SQL查询的性能。

The server has a large amount of RAM (192GB) and SSD disks. 服务器有大量的RAM(192GB)和SSD磁盘。

The real table has hundreds of millions of rows but I can reproduce my performance issue on a subset of the data and a simplified layout. 真正的表有数亿行,但我可以在一部分数据和简化的布局上重现我的性能问题。

Here is the (simplified) table definition: 这是(简化的)表定义:

CREATE TABLE `data` (
  `uri` varchar(255) NOT NULL,
  `category` tinyint(4) NOT NULL,
  `value` varchar(255) NOT NULL,
  PRIMARY KEY (`uri`,`category`),
  KEY `cvu` (`category`,`value`,`uri`),
  KEY `cu` (`category`,`uri`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

To reproduce the actual distribution of my content, I insert about 200'000 rows like this (bash script): 为了重现我的内容的实际分布,我插入了大约200'000行(bash脚本):

#!/bin/bash
for i in `seq 1 100000`;
do
  mysql mydb -e "INSERT INTO data (uri, category, value) VALUES ('uri${i}', 1, 'foo');"
done

for i in `seq 99981 200000`;
do
  mysql mydb -e "INSERT INTO data (uri, category, value) VALUES ('uri${i}', 2, '$(($i % 5))');"
done

So, we insert about: 所以,我们插入:

  • 100'000 rows in category 1 with a static string ("foo") as value 类别1中的100'000行,其中静态字符串(“foo”)作为值
  • 100'000 rows in category 2 with a number between 1 and 5 as the value 类别2中的100'000行,其值为1到5之间的值
  • 20 rows have a common "uri" between each dataset (category 1 / 2) 20行在每个数据集之间有一个共同的“uri”(类别1/2)

I always run an ANALYZE TABLE before querying. 我总是在查询之前运行ANALYZE TABLE。

Here is the explain output of the query I run: 这是我运行的查询的解释输出:

MariaDB [mydb]> EXPLAIN EXTENDED
    -> SELECT d2.uri, d2.value
    -> FROM data as d1
    -> INNER JOIN data as d2 ON d1.uri  = d2.uri AND d2.category = 2
    -> WHERE d1.category = 1 and d1.value  = 'foo';
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
| id   | select_type | table | type   | possible_keys  | key     | key_len | ref               | rows  | filtered | Extra       |
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
|    1 | SIMPLE      | d1    | ref    | PRIMARY,cvu,cu | cu      | 1       | const             | 92964 |   100.00 | Using where |
|    1 | SIMPLE      | d2    | eq_ref | PRIMARY,cvu,cu | PRIMARY | 768     | mydb.d1.uri,const |     1 |   100.00 |             |
+------+-------------+-------+--------+----------------+---------+---------+-------------------+-------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)

MariaDB [mydb]> SHOW WARNINGS;
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                                                                                              |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | select `mydb`.`d2`.`uri` AS `uri`,`mydb`.`d2`.`value` AS `value` from `mydb`.`data` `d1` join `mydb`.`data` `d2` where ((`mydb`.`d1`.`category` = 1) and (`mydb`.`d2`.`uri` = `mydb`.`d1`.`uri`) and (`mydb`.`d2`.`category` = 2) and (`mydb`.`d1`.`value` = 'foo')) |
+-------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

MariaDB [mydb]> SELECT d2.uri, d2.value FROM data as d1 INNER JOIN data as d2 ON d1.uri  = d2.uri AND d2.category = 2 WHERE d1.category = 1 and d1.value  = 'foo';
+-----------+-------+
| uri       | value |
+-----------+-------+
| uri100000 | 0     |
| uri99981  | 1     |
| uri99982  | 2     |
| uri99983  | 3     |
| uri99984  | 4     |
| uri99985  | 0     |
| uri99986  | 1     |
| uri99987  | 2     |
| uri99988  | 3     |
| uri99989  | 4     |
| uri99990  | 0     |
| uri99991  | 1     |
| uri99992  | 2     |
| uri99993  | 3     |
| uri99994  | 4     |
| uri99995  | 0     |
| uri99996  | 1     |
| uri99997  | 2     |
| uri99998  | 3     |
| uri99999  | 4     |
+-----------+-------+
20 rows in set (0.35 sec)

This query returns 20 rows in ~350ms. 此查询在~350ms内返回20行。

It seems quite slow to me. 对我来说似乎很慢。

Is there a way to improve performance of such query? 有没有办法提高此类查询的性能? Any advice? 有什么建议?

Can you try the following query? 你能尝试以下查询吗?

  SELECT dd.uri, max(case when dd.category=2 then dd.value end) v2
    FROM data as dd
   GROUP by 1 
  having max(case when dd.category=1 then dd.value end)='foo' and v2 is not null;

I cannot at the moment repeat your test, but my hope is that having to scan the table just once could compensate the usage of the aggregate functions. 我现在不能重复你的测试,但我希望只需扫描一次表可以补偿聚合函数的使用。

Edited 编辑

Created a test environment and tested some hypothesis. 创建了一个测试环境并测试了一些假设。 As of today, the best performance (for 1 million rows) has been: 截至今天,最佳性能(100万行)已经:

1 - Adding an index on uri column 1 - 在uri列上添加索引

2 - Using the following query 2 - 使用以下查询

 select d2.uri, d2.value 
   FROM data as d2 
  where exists (select 1 
                  from data d1 
                 where d1.uri  = d2.uri 
                   AND d1.category = 1 
                   and d1.value='foo') 
    and d2.category=2 
    and d2.uri in (select uri from data group by 1 having count(*) > 1);

The ironic thing is that in the first proposal I tried to minimize the access to the table and now I'm proposing three accesses. 具有讽刺意味的是,在第一个提案中,我试图尽量减少对表的访问,现在我提议进行三次访问。

Edited: 30/10 编辑:30/10

Ok, so I've done some other experiments and I would like to summarize the outcomes. 好的,所以我做了一些其他实验,我想总结一下结果。 First, I'd like to expand a bit Aruna answer: what I found interesting in the OP question, is that it is an exception to a classic "rule of thumb" in database optimization: if the # of desired results is very small compared to the dimension of the tables involved, it should be possible with the correct indexes to have a very good performance. 首先,我想扩展一下Aruna的答案:我在OP问题中发现的有趣之处在于它是数据库优化中经典“经验法则”的一个例外:如果所需结果的#非常小对于所涉及的表的维度,应该可以使用正确的索引来获得非常好的性能。

Why can't we simply add a "magic index" to have our 20 rows? 为什么我们不能简单地添加一个“魔术索引”来拥有我们的20行? Because we don't have any clear "attack vector".. I mean, there's no clearly selective criteria we can apply on a record to reduce significatevely the number of the target rows. 因为我们没有任何明确的“攻击向量”。我的意思是,没有明确的选择标准我们可以在记录上应用以显着减少目标行的数量。

Think about it: the fact that the value must be "foo" is just removing 50% of the table form the equation. 想一想:值必须是“foo”的事实就是从表中删除表的50%。 Also the category is not selective at all: the only interest thing is that, for 20 uri, they appear both in records with category 1 and 2. 此类别根本没有选择性:唯一感兴趣的是,对于20 uri,它们同时出现在类别1和2的记录中。

But here lies the issue: the condition involves comparing two rows, and unfortunately, to my knowledge, there's no way an index (not even the Oracle Function Based Indexes) can reduce a condition that is dependant on info on multiple rows. 但问题在于:条件涉及比较两行,不幸的是,据我所知,索引(甚至不是基于Oracle函数的索引)也无法减少依赖于多行信息的条件。

The conlclusion might be: if these kind of query is what you need, you should revise your data model. 结论可能是:如果您需要这些查询,则应修改数据模型。 For example, if you have a finite and small number of categories (lets' say three=, your table might be written as: 例如,如果你有一个有限的和少量的类别(让我们说三个=,你的表可能写成:

uri, value_category1, value_category2, value_category3 uri,value_category1,value_category2,value_category3

The query would be: 查询将是:

select uri, value_category2 where value_category1='foo' and value_category2 is not null; select uri,value_category2其中value_category1 ='foo',value_category2不为null;

By the way, let's go back tp the original question. 顺便说一句,让我们回到最初的问题。 I've created a slightly more efficient test data generator ( http://pastebin.com/DP8Uaj2t ). 我创建了一个稍微高效的测试数据生成器( http://pastebin.com/DP8Uaj2t )。

I've used this table: 我用过这张桌子:

 use mydb;
 DROP TABLE IF EXISTS data2;

 CREATE TABLE data2 
 ( 
  uri varchar(255) NOT NULL, 
  category tinyint(4) NOT NULL, 
  value varchar(255) NOT NULL, 
  PRIMARY KEY (uri,category), 
  KEY cvu (category,value,uri), 
  KEY ucv (uri,category,value), 
  KEY u (uri), 
  KEY cu (category,uri)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

The outcome is: 结果是:

 +--------------------------+----------+----------+----------+
 | query_descr              | num_rows | num      | num_test |
 +--------------------------+----------+----------+----------+
 | exists_plus_perimeter    |    10000 |   0.0000 |        5 |
 | exists_plus_perimeter    |    50000 |   0.0000 |        5 |
 | exists_plus_perimeter    |   100000 |   0.0000 |        5 |
 | exists_plus_perimeter    |   500000 |   2.0000 |        5 |
 | exists_plus_perimeter    |  1000000 |   4.8000 |        5 |
 | exists_plus_perimeter    |  5000000 |  26.7500 |        8 |
 | max_based                |    10000 |   0.0000 |        5 |
 | max_based                |    50000 |   0.0000 |        5 |
 | max_based                |   100000 |   0.0000 |        5 |
 | max_based                |   500000 |   3.2000 |        5 |
 | max_based                |  1000000 |   7.0000 |        5 |
 | max_based                |  5000000 |  49.5000 |        8 |
 | max_based_with_ucv       |    10000 |   0.0000 |        5 |
 | max_based_with_ucv       |    50000 |   0.0000 |        5 |
 | max_based_with_ucv       |   100000 |   0.0000 |        5 |
 | max_based_with_ucv       |   500000 |   2.6000 |        5 |
 | max_based_with_ucv       |  1000000 |   7.0000 |        5 |
 | max_based_with_ucv       |  5000000 |  36.3750 |        8 |
 | standard_join            |    10000 |   0.0000 |        5 |
 | standard_join            |    50000 |   0.4000 |        5 |
 | standard_join            |   100000 |   2.4000 |        5 |
 | standard_join            |   500000 |  13.4000 |        5 |
 | standard_join            |  1000000 |  33.2000 |        5 |
 | standard_join            |  5000000 | 205.2500 |        8 |
 | standard_join_plus_perim |  5000000 | 155.0000 |        2 |
 +--------------------------+----------+----------+----------+

The queries used are: - query_max_based_with_ucv.sql 使用的查询是: - query_max_based_with_ucv.sql
- query_exists_plus_perimeter.sql - query_exists_plus_perimeter.sql
- query_max_based.sql - query_max_based.sql
- query_max_based_with_ucv.sql - query_max_based_with_ucv.sql
- query_standard_join_plus_perim.sql query_standard_join.sql - query_standard_join_plus_perim.sql query_standard_join.sql

The best query is still the "query_exists_plus_perimeter"that I've put after the first environment creation. 最好的查询仍然是我在第一次环境创建后放置的“query_exists_plus_perimeter”。

It is mainly due to the number of rows analysed. 这主要是由于分析的行数。 Even though you have tables indexed the main decision making condition "WHERE d1.category = 1 and d1.value = 'foo'" filters huge amount of rows 即使你有表索引主要决策条件“WHERE d1.category = 1和d1.value ='foo'”过滤了大量的行

+------+-------------+-------+-.....-+-------+----------+-------------+
| id   | select_type | table |       | rows  | filtered | Extra       |
+------+-------------+-------+-.....-+-------+----------+-------------+
|    1 | SIMPLE      | d1    | ..... | 92964 |   100.00 | Using where |

Each and every matching row it has to read the table again which is for category 2. Since it is reading on primary key it can get the matching row directly. 它必须再次读取表中的每个匹配行,这对于类别2.由于它在主键上读取,它可以直接获得匹配的行。

On your original table check the cardinality of the combination of category and value. 在原始表格中,检查类别和值组合的基数。 If it is more towards unique, you can add an index on (category, value) and that should improve the performance. 如果它更加独特,您可以在(类别,值)上添加索引,这应该可以提高性能。 If it is same like the example given you may not get any performance improvement. 如果它与给出的示例相同,则可能无法获得任何性能提升。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM