How to improve SQL Query Performance

Question

I have the following DB Structure (simplified):

Payments
----------------------
Id        | int
InvoiceId | int
Active    | bit
Processed | bit


Invoices
----------------------
Id              | int
CustomerOrderId | int


CustomerOrders
------------------------------------
Id                       | int
ApprovalDate             | DateTime
ExternalStoreOrderNumber | nvarchar

Each Customer Order has an Invoice and each Invoice can have multiple Payments. The ExternalStoreOrderNumber is a reference to the order from the external partner store we imported the order from and the ApprovalDate the timestamp when that import happened.

Now we have the problem that we had a wrong import an need to change some payments to other invoices (several hundert, so too mach to do by hand) according to the following logic:
Search the Invoice of the Order which has the same external number as the current one but starts with 0 instead of the current digit.

To do that I created the following query:

UPDATE DB.dbo.Payments 
    SET InvoiceId=
        (SELECT TOP 1 I.Id FROM DB.dbo.Invoices AS I
            WHERE I.CustomerOrderId=
                (SELECT TOP 1 O.Id FROM DB.dbo.CustomerOrders AS O 
                    WHERE O.ExternalOrderNumber='0'+SUBSTRING(
                      (SELECT TOP 1 OO.ExternalOrderNumber FROM DB.dbo.CustomerOrders AS OO
                        WHERE OO.Id=I.CustomerOrderId), 1, 10000)))
    WHERE Id IN (
        SELECT P.Id
          FROM DB.dbo.Payments AS P
            JOIN DB.dbo.Invoices AS I ON I.Id=P.InvoiceId
            JOIN DB.dbo.CustomerOrders AS O ON O.Id=I.CustomerOrderId
         WHERE P.Active=0 AND P.Processed=0 AND O.ApprovalDate='2012-07-19 00:00:00'

Now I started that query on a test system using the live data (~250.000 rows in each table) and it is now running since 16h - did I do something completely wrong in the query or is there a way to speed it up a little?
It is not required to be really fast, as it is a one time task, but several hours seems long to me and as I want to learn for the (hopefully not happening) next time I would like some feedback how to improve...

Answer 1

You might as well kill the query. Your update subquery is completely un-correlated to the table being updated. From the looks of it, when it completes, EVERY SINGLE dbo.payments record will have the same value.

To break down your query, you might find that the subquery runs fine on its own.

SELECT TOP 1 I.Id FROM DB.dbo.Invoices AS I
            WHERE I.CustomerOrderId=
                (SELECT TOP 1 O.Id FROM DB.dbo.CustomerOrders AS O 
                    WHERE O.ExternalOrderNumber='0'+SUBSTRING(
                      (SELECT TOP 1 OO.ExternalOrderNumber FROM DB.dbo.CustomerOrders AS OO
                        WHERE OO.Id=I.CustomerOrderId), 1, 10000))

That is always a BIG worry.

The next thing is that it is running this row-by-row for every record in the table.

You are also double-dipping into payments, by selecting from where ... the id is from a join involving itself. You can reference a table for update in the JOIN clause using this pattern:

UPDATE P
....
  FROM DB.dbo.Payments AS P
    JOIN DB.dbo.Invoices AS I ON I.Id=P.InvoiceId
    JOIN DB.dbo.CustomerOrders AS O ON O.Id=I.CustomerOrderId
 WHERE P.Active=0 AND P.Processed=0 AND O.ApprovalDate='2012-07-19 00:00:00'

Moving on, another mistake is to use TOP without ORDER BY. That's asking for random results. If you know there's only one result, you wouldn't even need TOP. In this case, maybe you're ok with randomly choosing one from many possible matches. Since you have three levels of TOP(1) without ORDER BY, you might as well just mash them all up (join) and take a single TOP(1) across all of them. That would make it look like this

SET InvoiceId=
    (SELECT TOP 1 I.Id
     FROM DB.dbo.Invoices AS I
     JOIN DB.dbo.CustomerOrders AS O
        ON I.CustomerOrderId=O.Id
     JOIN DB.dbo.CustomerOrders AS OO
        ON O.ExternalOrderNumber='0'+SUBSTRING(OO.ExternalOrderNumber,1,100)
           AND OO.Id=I.CustomerOrderId)

However, as I mentioned very early on, this is not being correlated to the main FROM clause at all. We move the entire search into the main query so that we can make use of JOIN-based set operations rather than row-by-row subqueries.

Before I show the final query (fully commented), I think your SUBSTRING is supposed to address this logic but starts with 0 instead of the current digit . However, if that means how I read it, it means that for an order number '5678', you're looking for '0678' which would also mean that SUBSTRING should be using 2,10000 instead of 1,10000 .

UPDATE P
SET InvoiceId=II.Id
FROM DB.dbo.Payments AS P
-- invoices for payments
JOIN DB.dbo.Invoices AS I ON I.Id=P.InvoiceId
-- orders for invoices
JOIN DB.dbo.CustomerOrders AS O ON O.Id=I.CustomerOrderId
-- another order with '0' as leading digit
JOIN DB.dbo.CustomerOrders AS OO
  ON OO.ExternalOrderNumber='0'+substring(O.ExternalOrderNumber,2,1000)
-- invoices for this other order
JOIN DB.dbo.Invoices AS II ON OO.Id=II.CustomerOrderId

-- conditions for the Payments records
WHERE P.Active=0 AND P.Processed=0 AND O.ApprovalDate='2012-07-19 00:00:00'

It is worth noting that SQL Server allows UPDATE ..FROM ..JOIN which is less supported by other DBMS, eg Oracle. This is because for a single row in Payments (update target), I hope you can see that it is evident it could have many choices of II.Id to choose from from all the cartesian joins. You will get a random possible II.Id.

Answer 2

I think something like this will be more efficient ,if I understood your query right. As i wrote it by hand and didn't run it, it may has some syntax error.

UPDATE DB.dbo.Payments 
set InvoiceId=(SELECT TOP 1 I.Id FROM DB.dbo.Invoices AS I
         inner join DB.dbo.CustomerOrders AS O ON I.CustomerOrderId=O.Id 
         inner join DB.dbo.CustomerOrders AS OO On OO.Id=I.CustomerOrderId 
         and O.ExternalOrderNumber='0'+SUBSTRING(OO.ExternalOrderNumber, 1, 10000)))
FROM DB.dbo.Payments 
            JOIN DB.dbo.Invoices AS I ON I.Id=Payments.InvoiceId and 
             Payments.Active=0 
             AND Payments.Processed=0 
             AND O.ApprovalDate='2012-07-19 00:00:00'
            JOIN DB.dbo.CustomerOrders AS O ON O.Id=I.CustomerOrderId

Answer 3

Try to re-write using JOINs. This will highlight some of the problems. Will the following function do just the same? (The queries are somewhat different, but I guess this is roughly what you're trying to do)

UPDATE Payments 
   SET InvoiceId= I.Id
FROM DB.dbo.Payments
CROSS JOIN DB.dbo.Invoices AS I
INNER JOIN DB.dbo.CustomerOrders AS O
  ON I.CustomerOrderId = O.Id
INNER JOIN DB.dbo.CustomerOrders AS OO
  ON O.ExternalOrderNumer = '0' + SUBSTRING(OO.ExternalOrderNumber, 1, 10000)
  AND OO.Id = I.CustomerOrderId
WHERE P.Active=0 AND P.Processed=0 AND O.ApprovalDate='2012-07-19 00:00:00')

As you see, two problems stand out:

The undonditional join between Payments and Invoices (of course, you've caught this off by a TOP 1 statement, but set-wise it's still unconditional) - I'm not really sure if this really is a problem in your query. Will be in mine though :).
The join on a 10000-character column ( SUBSTRING ), embodied in a condition. This is highly inefficient.

If you need a one-time speedup, just take the queries on each table, try to store the in-between-results in temporary tables, create indices on those temporary tables and use the temporary tables to perform the update.

How to improve SQL Query Performance

Question

3 answers

solution1
3 ACCPTED 2012-10-02 11:48:01

solution2
0 2012-10-02 11:38:35

solution3
0 2012-10-02 11:43:07

How to improve SQL Query Performance

Question

3 answers

solution1 3 ACCPTED 2012-10-02 11:48:01

solution2 0 2012-10-02 11:38:35

solution3 0 2012-10-02 11:43:07

solution1
3 ACCPTED 2012-10-02 11:48:01

solution2
0 2012-10-02 11:38:35

solution3
0 2012-10-02 11:43:07