简体   繁体   English

如何使用 case 语句优化选择查询?

[英]How to optimize select query with case statements?

I have 3 tables over 1,000,000+ records.我有 3 个表,超过 1,000,000 条记录。 My select query is running for hours.我的选择查询运行了几个小时。 How to optimize it?如何优化它? I'm newbie.我是新手。

I tried to add index for name , still it taking hours to load.我尝试为name添加索引,但加载仍然需要数小时。

Like this,像这样,

ALTER TABLE table2 ADD INDEX(name);

and like this also,也像这样,

CREATE INDEX INDEX1 table2(name);

SELECT MS.*, P.Counts FROM 
(SELECT M.*, 
TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,               
CASE V.name 
WHEN 'text' THEN  M.name 
WHEN V.name IS NULL THEN M.name 
ELSE V.name 
END col1  
FROM table1 M 
LEFT JOIN table2 V ON M.id=V.id) AS MS
LEFT JOIN 
(select E.id, count(E.id) Counts 
from table3 E
where E.field2 = 'value1' 
group by E.id) AS P
ON MS.id=P.id;

Explain <above query>; 

output:输出:

+----+-------------+------------+------------+-------+---------------------------------------------+------------------+---------+------------------------+---------+----------+-----------------------------------------------------------------+
| id | select_type | table      | partitions | type  | possible_keys                               | key              | key_len | ref                    | rows    | filtered | Extra                                                           |
+----+-------------+------------+------------+-------+---------------------------------------------+------------------+---------+------------------------+---------+----------+-----------------------------------------------------------------+
|  1 | PRIMARY     | M          | NULL       | ALL   | NULL                                        | NULL             | NULL    | NULL                   |  344763 |   100.00 | NULL                                                            |
|  1 | PRIMARY     | <derived3> | NULL       | ref   | <auto_key0>                                 | <auto_key0>      | 8       | CP.M.id |      10 |   100.00 | NULL                                                            |
|  1 | PRIMARY     | V          | NULL       | index | NULL                                        | INDEX1           | 411     | NULL                   | 1411083 |   100.00 | Using where; Using index; Using join buffer (Block Nested Loop) |
|  3 | DERIVED     | E          | NULL       | ref   | PRIMARY,f2,f3                 | f2| 43      | const                  |  966442 |   100.00 | Using index                                                     |
+----+-------------+------------+------------+-------+---------------------------------------------+------------------+---------+------------------------+---------+----------+-----------------------------------------------------------------+

I expect to get result in less than 1 min.我希望在不到 1 分钟的时间内得到结果。

The query indented for clarity.为清楚起见,查询缩进。

SELECT MS.*, P.Counts
  FROM  (
           SELECT M.*, 
                  TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,               
             CASE V.name 
                  WHEN 'text' THEN  M.name 
                  WHEN V.name IS NULL THEN M.name 
                  ELSE V.name 
                  END col1  
             FROM table1 M 
             LEFT JOIN table2 V ON M.id=V.id
      ) AS MS
  LEFT JOIN ( 
                  select E.id, count(E.id) Counts 
                   from table3 E
                   where E.field2 = 'value1' 
                   group by E.id
    ) AS P ON MS.id=P.id;

Your query has no filtering predicate, so it's essentially retrieving all the rows.您的查询没有过滤谓词,因此它本质上是检索所有行。 That is a 1,000,000+ rows from table1 .这是table1的 1,000,000+ 行。 Then it's joining it with table2 , and then with another table expression/derived table.然后将它与table2连接,然后与另一个表表达式/派生表连接。

Why do you expect this query to be fast?为什么你希望这个查询很快? A massive query like this one will normally run as a batch process at night.像这样的大规模查询通常会在晚上作为批处理运行。 I assume this query is not for an online process, right?我认为此查询不适用于在线流程,对吗?

Maybe you need to rethink the process.也许你需要重新考虑这个过程。 Do you really need to process millions of rows at once interactively?您真的需要以交互方式一次处理数百万行吗? Will the user read a million rows in the web page?用户会阅读网页中的一百万行吗?

For starters, you are returning the same result for 'col1' in case v.name is null or v.name != 'text'.对于初学者,如果 v.name 为 null 或 v.name != 'text',您将返回相同的结果 'col1'。 That said, you can include that extra condition on you join with table2 and use IFNULL function.也就是说,您可以在加入 table2 并使用 IFNULL 函数时包含该额外条件。

Has you are filtering table3 by field2, you could probably create an index over table 3 that includes field2.您是否按 field2 过滤 table3,您可能可以在包含 field2 的表 3 上创建索引。

You should also check if you can include any additional filter for any of those tables, and if you do you can consider using a stored procedure to get the results.您还应该检查是否可以为这些表中的任何一个包含任何其他过滤器,如果可以,您可以考虑使用存储过程来获取结果。

Also, I don´t see why you need to the aggregate the first join into 'MS' you can easy do all the joins in one go like this:另外,我不明白为什么您需要将第一个连接聚合到“MS”中,您可以像这样轻松地一次性完成所有连接:

SELECT 
    M.*, 
    TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,               
    IFNULL(V.name, M.name) as col1,
    P.Counts 
FROM table1 M 
LEFT JOIN table2 V ON M.id=V.id AND V.name <> 'text'
LEFT JOIN 
(SELECT 
    E.id, 
    COUNT(E.id) Counts 
FROM table3 E
WHERE E.field2 = 'value1' 
GROUP BY E.id) AS P ON M.id=P.id;

I'm also assuming that you do have clustered indexes for all id fields in all this three tables, but with no filter, if you are dealing with millions off records, this will always be an big heavy query.我还假设您确实对所有这三个表中的所有 id 字段都有聚集索引,但没有过滤器,如果您要处理数百万条记录,这将始终是一个很大的繁重查询。 To say the least your are doing a table scan for table1.至少可以说您正在对 table1 进行表扫描。

I've included this additional information after you comment.在您发表评论后,我已包含此附加信息。

I've mentioned clustered index, but according to the official documentation about indexeshere我已经提到了聚集索引,但是根据这里关于索引的官方文档

When you define a PRIMARY KEY on your table, InnoDB uses it as the clustered index.当您在表上定义 PRIMARY KEY 时,InnoDB 将其用作聚集索引。 So if you already have a primary key defined you don't need to do anything else.因此,如果您已经定义了主键,则无需执行任何其他操作。 Has the documentation also point's out you should define a primary key for each table that you create.有文档还指出您应该为您创建的每个表定义一个主键。

If you don't have a primary key.如果没有主键。 Here is the code snippet you requested.这是您请求的代码片段。

ALTER TABLE table1 ADD CONSTRAINT pk_table1
 PRIMARY KEY CLUSTERED (id);

ATTENTION: Keep in mind that creating a clustered index is a big operation, for tables like yours with tones of data.注意:请记住,创建聚簇索引是一项大操作,对于像您这样具有数据色调的表。 This isn't something you want to do without planning, on a production server.在生产服务器上,这不是您没有计划就想做的事情。 This operation will also take a long time and table will be locked during the process.此操作也将花费很长时间,并且在此过程中表将被锁定。

Subqueries are not always well-optimized.子查询并不总是得到很好的优化。

I think you can flatten it out something like:我想你可以把它弄平,比如:

SELECT  M.*, V.*,
        TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,
        CASE V.name WHEN 'text'          THEN M.name
                    WHEN V.name IS NULL  THEN M.name
                                         ELSE V.name  END col1,
        ( SELECT COUNT(*) FROM table3 WHERE field2 = 'value1' AND id = x.id
        ) AS Counts
    FROM table1 AS M
    LEFT JOIN table2 AS V  ON M.id = V.id

I may have some parts not quite right;我可能有些地方不太对; see if you can make this formulation work.看看你能不能让这个公式起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM