简体   繁体   English

性能不佳的Mysql子查询 - 我可以把它变成一个Join吗?

[英]Poorly performing Mysql subquery — can I turn it into a Join?

I have a subquery problem that is causing poor performance... I was thinking that the subquery could be re-written using a join, but I'm having a hard time wrapping my head around it. 我有一个导致性能不佳的子查询问题......我认为子查询可以使用连接重写,但我很难绕过它。

The gist of the query is this: For a given combination of EmailAddress and Product, I need to get a list of the IDs that are NOT the latest.... these orders are going to be marked as 'obsolete' in the table which would leave only that latest order for aa given combination of EmailAddress and Product... (does that make sense?) 查询的要点是这样的:对于给定的EmailAddress和Product的组合,我需要得到一个不是最新的ID列表....这些订单将在表格中标记为“过时”只留下给定的EmailAddress和Product组合的最新订单......(这有意义吗?)

Table Definition 表定义

CREATE TABLE  `sandbox`.`OrderHistoryTable` (
 `id` INT( 11 ) NOT NULL AUTO_INCREMENT ,
 `EmailAddress` VARCHAR( 100 ) NOT NULL ,
 `Product` VARCHAR( 100 ) NOT NULL ,
 `OrderDate` DATE NOT NULL ,
 `rowlastupdated` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ,
PRIMARY KEY (  `id` ) ,
KEY  `EmailAddress` (  `EmailAddress` ) ,
KEY  `Product` (  `Product` ) ,
KEY  `OrderDate` (  `OrderDate` )
) ENGINE = MYISAM DEFAULT CHARSET = latin1;

Query 询问

SELECT id
FROM
OrderHistoryTable AS EMP1
WHERE
OrderDate not in 
   (
   Select max(OrderDate)
   FROM OrderHistoryTable AS EMP2
   WHERE 
       EMP1.EmailAddress =  EMP2.EmailAddress
   AND EMP1.Product IN ('ProductA','ProductB','ProductC','ProductD')
   AND EMP2.Product IN ('ProductA','ProductB','ProductC','ProductD')
   )

Explanation of duplicate 'IN' statements 重复'IN'语句的解释

13   bob@aol.com  ProductA  2010-10-01
15   bob@aol.com  ProductB  2010-20-02
46   bob@aol.com  ProductD  2010-20-03
57   bob@aol.com  ProductC  2010-20-04
158  bob@aol.com  ProductE  2010-20-05
206  bob@aol.com  ProductB  2010-20-06
501  bob@aol.com  ProductZ  2010-20-07

The results of my query should be | 我的查询结果应为| 13 | 13 | | | 15 | 15 | | | 46 | 46 | | | 57 | 57 |

This is because, in the orders listed, those 4 have been 'superceded' by a newer order for a product in the same category. 这是因为,在列出的订单中,这4个已被同一类别的产品的新订单“取代”。 This 'category' contains prodcts A, B, C & D. 该“类别”包含产品A,B,C和D.

Order ids 158 and 501 show no other orders in their respective categories based on the query. 订单ID 158和501基于查询在其各自的类别中不显示其他订单。

Final Query based off of accepted answer below: I ended up using the following query with no subquery and got about 3X performance (30 sec down from 90 sec). 最终查询基于以下接受的答案:我最终使用了以下查询而没有子查询,并且获得了大约3倍的性能(从90秒下降30秒)。 I also now have a separate 'groups' table where I can enumerate the group members instead of spelling them out in the query itself... 我现在还有一个单独的“组”表,我可以枚举组成员,而不是在查询本身中拼写出来...

SELECT DISTINCT id, EmailAddress FROM (
  SELECT a.id, a.EmailAddress, a.OrderDate
  FROM OrderHistoryTable a
  INNER JOIN OrderHistoryTable b ON a.EmailAddress = b.EmailAddress
  INNER JOIN groups g1  ON  a.Product = g1.Product 
  INNER JOIN groups g2  ON  b.Product = g2.Product 
  WHERE 
        g1.family = 'ProductGroupX'
    AND g2.family = 'ProductGroupX'
  GROUP BY a.id, a.OrderDate, b.OrderDate
  HAVING  a.OrderDate < MAX(b.OrderDate)
) dtX

Use: 采用:

   SELECT a.id
     FROM ORDERHISTORYTABLE AS a
LEFT JOIN (SELECT e.EmailAddress,
                  e.product,
                  MAX(OrderDate) AS max_date
             FROM OrderHistoryTable AS e
            WHERE e.Product IN ('ProductA','ProductB','ProductC','ProductD')
         GROUP BY e.EmailAddress) b ON b.emailaddress = a.emailaddress
                                   AND b.max_date = a.orderdate
                                   AND b.product = a.product
    WHERE x.emailaddress IS NULL
      AND a.Product IN ('ProductA','ProductB','ProductC','ProductD')

Rant: OMG Ponies' answer gives what you asked for - a rewrite with a join. Rant: OMG小马的答案给出了你要求的东西 - 用连接重写。 But I would not be too excited about it, your performance killer is the inside join on email address which, I assume, is not particular selective and then your database needs to sift through all of those rows looking for the MAX of order date. 但我不会太兴奋,你的性能杀手是电子邮件地址的内部联接,我认为,这不是特别选择性的,然后你的数据库需要筛选所有那些寻找订单日期最大值的行。

This in reality for MySQL will mean doing a filesort (can you post EXPLAIN SELECT ....?). 这对于MySQL来说实际上意味着要做一个文件排序(你可以发布EXPLAIN SELECT ....?)。

Now, if mysql had access to an index that would include emailaddress , product and orderdate it might, especially on MyISAM be much more efficient in determining MAX(orderdate) (and no, having an index on each of the columns is not the same as having a composite index on all of the columns). 现在,如果mysql可以访问包含emailaddressproductorderdate的索引,特别是在MyISAM上可以更有效地确定MAX(orderdate)(并且不会,每个列上的索引都不同于在所有列上都有一个复合索引。 If I was trying to optimize that query, I would bet on that. 如果我试图优化该查询,我会打赌。

Other than this rant here's my version of not the latest from a category (I don't expect it to be better, but it is different and you should test the performance; it just might be faster due to lack of subqueries) 除了这个咆哮之外,我的版本not the latest from a category版本(我不认为它会更好,但它是不同的,你应该测试性能;它可能因为缺少子查询而更快)

My attempt (not tested) 我的尝试 (未经测试)

SELECT DISTINCT
    notlatest.id, 
    notlatest.emailaddress, 
    notlatest.product, 
    notlatest.orderdate
FROM
    OrderHistoryTable AS notlatest
    LEFT JOIN OrderHistoryTable AS EMP latest ON 
        notlatest.emailaddress = latest.emailaddress AND
        notlatest.orderdate < latest.orderdate AND
WHERE
    notlatest.product IN ('ProductA','ProductB','ProductC','ProductD') AND
    latest.product IN ('ProductA','ProductB','ProductC','ProductD') AND
    latest.id IS NOT NULL

Comments: 评论:
- If there is only one record in the category it will not be displayed - 如果类别中只有一条记录,则不会显示
- Again indexes should speed the above very much - 再次索引应该加快上述速度

Actually this is (might be) a good example of how normalizing data would improve performance - your product implies product category, but product category is not stored anywhere and the IN test will not be maintainable in the long run. 实际上,这可能是(可能)一个很好的例子,说明数据标准化将如何提高性能 - 您的产品意味着产品类别,但产品类别不会存储在任何地方,从长远来看IN测试将无法维护。

Furthermore by creating a product category you would be able to index directly on it. 此外,通过创建产品类别,您可以直接在其上编制索引

If the Product was indexed on the category then the performance of joins on the category should be better then test on the Product indexed by value (and not category). 如果产品在类别上编入索引,那么类别上的联接性能应该更好,然后对按值(而不是类别)索引的产品进行测试。 (Actually then MyISAM's composite index on emailaddress , category , orderdate should already contain max, min and count per category and that should be cheap). (实际上MyISAM的emailaddresscategoryorderdate的综合索引应该已经包含每个类别的最大,最小和计数,这应该是便宜的)。

My MySQL is a bit rusty (I'm used to MSSQL), but here's my best guess. 我的MySQL有点生疏(我已经习惯了MSSQL),但这是我最好的猜测。 It might need a bit of tweaking in the GROUP BY and HAVING clauses. 它可能需要在GROUP BYHAVING子句中进行一些调整。 Also, I assumed from your duplicate IN statements that you want the Products to match in both tables. 此外,我从您的重复IN语句中假设您希望产品在两个表中都匹配。 If this isn't the case, I'll adjust the query. 如果不是这种情况,我会调整查询。

SELECT a.id
FROM OrderHistoryTable a
INNER JOIN OrderHistoryTable b
    ON a.Product = b.Product AND
       a.Employee = b.Employee
WHERE a.Product IN ('ProductA','ProductB','ProductC','ProductD')
GROUP BY a.id, a.OrderDate, b.OrderDate, 
HAVING b.OrderDate < MAX(a.OrderDate)

Edit: removed extraneous AND . 编辑:删除无关的AND

SELECT  *
FROM    (
        SELECT  product, MAX(OrderDate) AS md
        FROM    OrderHistoryTable
        WHERE   product IN ('ProductA','ProductB','ProductC','ProductD')
        GROUP BY
                product
        ) ohti
JOIN    orderhistorytable oht
ON      oht.product = ohti.product
        AND oht.orderdate <> ohti.md

Create an index on OrderHistoryTable (product, orderdate) for this to work fast. OrderHistoryTable (product, orderdate)上创建一个索引OrderHistoryTable (product, orderdate)以便快速工作。

Also note that it will return duplicates of the MAX(orderdate) within a product, if any. 另请注意,如果有的话,它将返回产品中MAX(orderdate)重复项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM