SQL Server 2005 performance issue with DISTINCT

Question

I have a table tblStkMst2 which has 87 columns and 53,000 rows. If I execute the following query it takes 83 to 96 milliseconds (Core2 Duo, 2.8 GHz, 2 GB of RAM). But when I use a distinct keyword it takes 1086 to 1103 milliseconds (more than 1 second). It is really expensive. If I apply duplicate removal algorithm on 53,000 rows of data it does not take 1 seconds.

Is there any other way in SQL Server 2005 to improve execution time?

declare @monthOnly int                  set @monthOnly = 12
declare @yearOnly int                   set @yearOnly = 2011

SELECT  --(distinct)--

tblSModelMst.SMNo as [ModelID] 
,tblSModelMst.Vehicle as [ModelName]

FROM tblStkMst2 

INNER JOIN tblDCDetail ON tblStkMst2.DCNo = tblDCDetail.DCNo AND tblDCDetail.Refund=0 
INNER JOIN tblSModelMst ON tblStkMst2.SMno = tblSModelMst.SMNo 
INNER JOIN tblBuyerMst ON tblDCDetail.BNo = tblBuyerMst.BNo 
LEFT OUTER JOIN tblSModelSegment ON tblSModelMst.SMSeg = tblSModelSegment.ID
left outer JOIN dbo.tblProdManager as pd ON pd.PMID = tblBuyerMst.PMId


WHERE   (pd.Active = 1) AND ((tblStkMst2.ISSFlg = 1) or  (tblStkMst2.IsBooked = 1))
    AND (MONTH(tblStkMst2.SIssDate) = @monthOnly) AND (YEAR(tblStkMst2.SIssDate) = @yearOnly)

Answer 1

It is not that DISTINCT is very expensive (this is only 53000 rows, which is tiny). You are seeing a significant performance difference because SQL server is choosing a completely different query plan when you add DISTINCT. Without seeing the query plans it is very difficult to see what is happening.

There are a couple of things in your query though which you could do better which could significantly improve performance.

(1) Avoid where clauses like this where you need to transform a column:

AND (MONTH(tblStkMst2.SIssDate) = @monthOnly) AND (YEAR(tblStkMst2.SIssDate) = @yearOnly)

If you have an index on the SIssDate column SQL Server won't be able to use it (it will likely do a table scan as I suspect it won't be able to use another index).

If you want to take advantage of the SIssDate index, it is better if you try and convert the @monthOnly/@yearonly parameters into a min and max date and use these in the query:

AND (tblStkMst2.SIssDate between @minDate and @maxDate);

If you have a surrogate primary key (which is the clustered index) on the table, it may be useful to do this before you run your query (assuming your surrogate primary key is called tblStkMst2_id)

SELECT @minId=MIN(tblStkMst2_id), @maxId=(tblStkMst2_id)
FROM
tblStkMst2 WHERE tblStkMsg2.SIssDate between @minDate and @maxDate;

This should be very fast as SQL server should not even need to look at the table (just at the SIssDate non-clustered index and the tblStkMst2_id clustered index).

Then you can do this in your main query (instead of the date check):

AND (tblStkMst2.tblStkMst2_id BETWEEN @minId and @maxId);

Using the clustered index is much faster than using a non-clustered index as the DB will be able to sequentially access these records (rather than going through the non-clustered index redirect).

(2) Delay the join to tblStkMst2 until after you do the DISTINCT (or GROUP BY). The fewer entries in the DISTINCT (GROUP BY) the better.

Answer 2

SQL Server optimizes to avoid worst-case execution. This can lead it to prefer a suboptimal algorithm, like preferring a disk sort over a hash sort, just to be on the safe side.

For a limited number of distinct values, a hash sort is the fastest way to execute a distinct operation. A hash sort trades memory for execution speed. But if you have a large number of values, the hash sort breaks down because the hash is too large to store in memory. So you need a way to tell SQL Server that the hash will fit into memory.

One possible way to do that is to use a temporary table:

declare @t (ModelID int, ModelName varchar(50))
insert @t (ModelID, ModelName) select ...your original query here...
select distinct ModelID, ModelName from @t

SQL Server will know the size of the temporary table, allowing it to choose a better algorithm in many cases.

Answer 3

Several ways.

1 - Don't use DISTINCT

2 - Create an index on TblSModelMst(SMNo) INCLUDE (Vehicle) , and index your other JOIN keys.

You really should figure out why you get duplicates and take care of that first. It's likely additional matching rows in one or more of your JOIN ed tables.

DISTINCT has it's place but is heavily overused to obscure data issues, and it's a very expensive operator, especially when you have a large number of rows you are filtering down from.

To get a more complete answer you need to explain your data structure and what you are trying to achieve.

SQL Server 2005 performance issue with DISTINCT

Question

3 answers

solution1
4 2012-01-21 23:04:54

solution2
1 2012-01-21 11:50:06

solution3
0 2012-01-21 11:48:46

SQL Server 2005 performance issue with DISTINCT

Question

3 answers

solution1 4 2012-01-21 23:04:54

solution2 1 2012-01-21 11:50:06

solution3 0 2012-01-21 11:48:46

solution1
4 2012-01-21 23:04:54

solution2
1 2012-01-21 11:50:06

solution3
0 2012-01-21 11:48:46