Poorly performing Mysql subquery — can I turn it into a Join?

Question

I have a subquery problem that is causing poor performance... I was thinking that the subquery could be re-written using a join, but I'm having a hard time wrapping my head around it.

The gist of the query is this: For a given combination of EmailAddress and Product, I need to get a list of the IDs that are NOT the latest.... these orders are going to be marked as 'obsolete' in the table which would leave only that latest order for aa given combination of EmailAddress and Product... (does that make sense?)

Table Definition

CREATE TABLE  `sandbox`.`OrderHistoryTable` (
 `id` INT( 11 ) NOT NULL AUTO_INCREMENT ,
 `EmailAddress` VARCHAR( 100 ) NOT NULL ,
 `Product` VARCHAR( 100 ) NOT NULL ,
 `OrderDate` DATE NOT NULL ,
 `rowlastupdated` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ,
PRIMARY KEY (  `id` ) ,
KEY  `EmailAddress` (  `EmailAddress` ) ,
KEY  `Product` (  `Product` ) ,
KEY  `OrderDate` (  `OrderDate` )
) ENGINE = MYISAM DEFAULT CHARSET = latin1;

Query

SELECT id
FROM
OrderHistoryTable AS EMP1
WHERE
OrderDate not in 
   (
   Select max(OrderDate)
   FROM OrderHistoryTable AS EMP2
   WHERE 
       EMP1.EmailAddress =  EMP2.EmailAddress
   AND EMP1.Product IN ('ProductA','ProductB','ProductC','ProductD')
   AND EMP2.Product IN ('ProductA','ProductB','ProductC','ProductD')
   )

Explanation of duplicate 'IN' statements

13   bob@aol.com  ProductA  2010-10-01
15   bob@aol.com  ProductB  2010-20-02
46   bob@aol.com  ProductD  2010-20-03
57   bob@aol.com  ProductC  2010-20-04
158  bob@aol.com  ProductE  2010-20-05
206  bob@aol.com  ProductB  2010-20-06
501  bob@aol.com  ProductZ  2010-20-07

The results of my query should be | 13 | | 15 | | 46 | | 57 |

This is because, in the orders listed, those 4 have been 'superceded' by a newer order for a product in the same category. This 'category' contains prodcts A, B, C & D.

Order ids 158 and 501 show no other orders in their respective categories based on the query.

Final Query based off of accepted answer below: I ended up using the following query with no subquery and got about 3X performance (30 sec down from 90 sec). I also now have a separate 'groups' table where I can enumerate the group members instead of spelling them out in the query itself...

SELECT DISTINCT id, EmailAddress FROM (
  SELECT a.id, a.EmailAddress, a.OrderDate
  FROM OrderHistoryTable a
  INNER JOIN OrderHistoryTable b ON a.EmailAddress = b.EmailAddress
  INNER JOIN groups g1  ON  a.Product = g1.Product 
  INNER JOIN groups g2  ON  b.Product = g2.Product 
  WHERE 
        g1.family = 'ProductGroupX'
    AND g2.family = 'ProductGroupX'
  GROUP BY a.id, a.OrderDate, b.OrderDate
  HAVING  a.OrderDate < MAX(b.OrderDate)
) dtX

Answer 1

Use:

   SELECT a.id
     FROM ORDERHISTORYTABLE AS a
LEFT JOIN (SELECT e.EmailAddress,
                  e.product,
                  MAX(OrderDate) AS max_date
             FROM OrderHistoryTable AS e
            WHERE e.Product IN ('ProductA','ProductB','ProductC','ProductD')
         GROUP BY e.EmailAddress) b ON b.emailaddress = a.emailaddress
                                   AND b.max_date = a.orderdate
                                   AND b.product = a.product
    WHERE x.emailaddress IS NULL
      AND a.Product IN ('ProductA','ProductB','ProductC','ProductD')

Answer 2

Rant: OMG Ponies' answer gives what you asked for - a rewrite with a join. But I would not be too excited about it, your performance killer is the inside join on email address which, I assume, is not particular selective and then your database needs to sift through all of those rows looking for the MAX of order date.

This in reality for MySQL will mean doing a filesort (can you post EXPLAIN SELECT ....?).

Now, if mysql had access to an index that would include emailaddress , product and orderdate it might, especially on MyISAM be much more efficient in determining MAX(orderdate) (and no, having an index on each of the columns is not the same as having a composite index on all of the columns). If I was trying to optimize that query, I would bet on that.

Other than this rant here's my version of not the latest from a category (I don't expect it to be better, but it is different and you should test the performance; it just might be faster due to lack of subqueries)

My attempt (not tested)

SELECT DISTINCT
    notlatest.id, 
    notlatest.emailaddress, 
    notlatest.product, 
    notlatest.orderdate
FROM
    OrderHistoryTable AS notlatest
    LEFT JOIN OrderHistoryTable AS EMP latest ON 
        notlatest.emailaddress = latest.emailaddress AND
        notlatest.orderdate < latest.orderdate AND
WHERE
    notlatest.product IN ('ProductA','ProductB','ProductC','ProductD') AND
    latest.product IN ('ProductA','ProductB','ProductC','ProductD') AND
    latest.id IS NOT NULL

Comments:
- If there is only one record in the category it will not be displayed
- Again indexes should speed the above very much

Actually this is (might be) a good example of how normalizing data would improve performance - your product implies product category, but product category is not stored anywhere and the IN test will not be maintainable in the long run.

Furthermore by creating a product category you would be able to index directly on it.

If the Product was indexed on the category then the performance of joins on the category should be better then test on the Product indexed by value (and not category). (Actually then MyISAM's composite index on emailaddress , category , orderdate should already contain max, min and count per category and that should be cheap).

Answer 3

My MySQL is a bit rusty (I'm used to MSSQL), but here's my best guess. It might need a bit of tweaking in the GROUP BY and HAVING clauses. Also, I assumed from your duplicate IN statements that you want the Products to match in both tables. If this isn't the case, I'll adjust the query.

SELECT a.id
FROM OrderHistoryTable a
INNER JOIN OrderHistoryTable b
    ON a.Product = b.Product AND
       a.Employee = b.Employee
WHERE a.Product IN ('ProductA','ProductB','ProductC','ProductD')
GROUP BY a.id, a.OrderDate, b.OrderDate, 
HAVING b.OrderDate < MAX(a.OrderDate)

Edit: removed extraneous AND .

Answer 4

SELECT  *
FROM    (
        SELECT  product, MAX(OrderDate) AS md
        FROM    OrderHistoryTable
        WHERE   product IN ('ProductA','ProductB','ProductC','ProductD')
        GROUP BY
                product
        ) ohti
JOIN    orderhistorytable oht
ON      oht.product = ohti.product
        AND oht.orderdate <> ohti.md

Create an index on OrderHistoryTable (product, orderdate) for this to work fast.

Also note that it will return duplicates of the MAX(orderdate) within a product, if any.

Poorly performing Mysql subquery — can I turn it into a Join?

Question

4 answers

solution1
5 2010-10-28 15:54:07

solution2
2 ACCPTED 2010-11-01 15:24:18

solution3
1 2010-10-28 19:18:47

solution4
0 2010-11-01 13:50:12

Poorly performing Mysql subquery — can I turn it into a Join?

Question

4 answers

solution1 5 2010-10-28 15:54:07

solution2 2 ACCPTED 2010-11-01 15:24:18

solution3 1 2010-10-28 19:18:47

solution4 0 2010-11-01 13:50:12

solution1
5 2010-10-28 15:54:07

solution2
2 ACCPTED 2010-11-01 15:24:18

solution3
1 2010-10-28 19:18:47

solution4
0 2010-11-01 13:50:12