I have a subquery problem that is causing poor performance... I was thinking that the subquery could be re-written using a join, but I'm having a hard time wrapping my head around it.
The gist of the query is this: For a given combination of EmailAddress and Product, I need to get a list of the IDs that are NOT the latest.... these orders are going to be marked as 'obsolete' in the table which would leave only that latest order for aa given combination of EmailAddress and Product... (does that make sense?)
Table Definition
CREATE TABLE `sandbox`.`OrderHistoryTable` (
`id` INT( 11 ) NOT NULL AUTO_INCREMENT ,
`EmailAddress` VARCHAR( 100 ) NOT NULL ,
`Product` VARCHAR( 100 ) NOT NULL ,
`OrderDate` DATE NOT NULL ,
`rowlastupdated` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ,
PRIMARY KEY ( `id` ) ,
KEY `EmailAddress` ( `EmailAddress` ) ,
KEY `Product` ( `Product` ) ,
KEY `OrderDate` ( `OrderDate` )
) ENGINE = MYISAM DEFAULT CHARSET = latin1;
Query
SELECT id
FROM
OrderHistoryTable AS EMP1
WHERE
OrderDate not in
(
Select max(OrderDate)
FROM OrderHistoryTable AS EMP2
WHERE
EMP1.EmailAddress = EMP2.EmailAddress
AND EMP1.Product IN ('ProductA','ProductB','ProductC','ProductD')
AND EMP2.Product IN ('ProductA','ProductB','ProductC','ProductD')
)
Explanation of duplicate 'IN' statements
13 bob@aol.com ProductA 2010-10-01
15 bob@aol.com ProductB 2010-20-02
46 bob@aol.com ProductD 2010-20-03
57 bob@aol.com ProductC 2010-20-04
158 bob@aol.com ProductE 2010-20-05
206 bob@aol.com ProductB 2010-20-06
501 bob@aol.com ProductZ 2010-20-07
The results of my query should be | 13 | | 15 | | 46 | | 57 |
This is because, in the orders listed, those 4 have been 'superceded' by a newer order for a product in the same category. This 'category' contains prodcts A, B, C & D.
Order ids 158 and 501 show no other orders in their respective categories based on the query.
Final Query based off of accepted answer below: I ended up using the following query with no subquery and got about 3X performance (30 sec down from 90 sec). I also now have a separate 'groups' table where I can enumerate the group members instead of spelling them out in the query itself...
SELECT DISTINCT id, EmailAddress FROM (
SELECT a.id, a.EmailAddress, a.OrderDate
FROM OrderHistoryTable a
INNER JOIN OrderHistoryTable b ON a.EmailAddress = b.EmailAddress
INNER JOIN groups g1 ON a.Product = g1.Product
INNER JOIN groups g2 ON b.Product = g2.Product
WHERE
g1.family = 'ProductGroupX'
AND g2.family = 'ProductGroupX'
GROUP BY a.id, a.OrderDate, b.OrderDate
HAVING a.OrderDate < MAX(b.OrderDate)
) dtX
Use:
SELECT a.id
FROM ORDERHISTORYTABLE AS a
LEFT JOIN (SELECT e.EmailAddress,
e.product,
MAX(OrderDate) AS max_date
FROM OrderHistoryTable AS e
WHERE e.Product IN ('ProductA','ProductB','ProductC','ProductD')
GROUP BY e.EmailAddress) b ON b.emailaddress = a.emailaddress
AND b.max_date = a.orderdate
AND b.product = a.product
WHERE x.emailaddress IS NULL
AND a.Product IN ('ProductA','ProductB','ProductC','ProductD')
Rant: OMG Ponies' answer gives what you asked for - a rewrite with a join. But I would not be too excited about it, your performance killer is the inside join on email address which, I assume, is not particular selective and then your database needs to sift through all of those rows looking for the MAX of order date.
This in reality for MySQL will mean doing a filesort (can you post EXPLAIN SELECT ....?).
Now, if mysql had access to an index that would include emailaddress
, product
and orderdate
it might, especially on MyISAM be much more efficient in determining MAX(orderdate) (and no, having an index on each of the columns is not the same as having a composite index on all of the columns). If I was trying to optimize that query, I would bet on that.
Other than this rant here's my version of not the latest from a category
(I don't expect it to be better, but it is different and you should test the performance; it just might be faster due to lack of subqueries)
My attempt (not tested)
SELECT DISTINCT
notlatest.id,
notlatest.emailaddress,
notlatest.product,
notlatest.orderdate
FROM
OrderHistoryTable AS notlatest
LEFT JOIN OrderHistoryTable AS EMP latest ON
notlatest.emailaddress = latest.emailaddress AND
notlatest.orderdate < latest.orderdate AND
WHERE
notlatest.product IN ('ProductA','ProductB','ProductC','ProductD') AND
latest.product IN ('ProductA','ProductB','ProductC','ProductD') AND
latest.id IS NOT NULL
Comments:
- If there is only one record in the category it will not be displayed
- Again indexes should speed the above very much
Actually this is (might be) a good example of how normalizing data would improve performance - your product implies product category, but product category is not stored anywhere and the IN test will not be maintainable in the long run.
Furthermore by creating a product category you would be able to index directly on it.
If the Product was indexed on the category then the performance of joins on the category should be better then test on the Product indexed by value (and not category). (Actually then MyISAM's composite index on emailaddress
, category
, orderdate
should already contain max, min and count per category and that should be cheap).
My MySQL is a bit rusty (I'm used to MSSQL), but here's my best guess. It might need a bit of tweaking in the GROUP BY
and HAVING
clauses. Also, I assumed from your duplicate IN statements that you want the Products to match in both tables. If this isn't the case, I'll adjust the query.
SELECT a.id
FROM OrderHistoryTable a
INNER JOIN OrderHistoryTable b
ON a.Product = b.Product AND
a.Employee = b.Employee
WHERE a.Product IN ('ProductA','ProductB','ProductC','ProductD')
GROUP BY a.id, a.OrderDate, b.OrderDate,
HAVING b.OrderDate < MAX(a.OrderDate)
Edit: removed extraneous AND
.
SELECT *
FROM (
SELECT product, MAX(OrderDate) AS md
FROM OrderHistoryTable
WHERE product IN ('ProductA','ProductB','ProductC','ProductD')
GROUP BY
product
) ohti
JOIN orderhistorytable oht
ON oht.product = ohti.product
AND oht.orderdate <> ohti.md
Create an index on OrderHistoryTable (product, orderdate)
for this to work fast.
Also note that it will return duplicates of the MAX(orderdate)
within a product, if any.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.