简体   繁体   English

SQL Server 2005的DISTINCT性能问题

[英]SQL Server 2005 performance issue with DISTINCT

I have a table tblStkMst2 which has 87 columns and 53,000 rows. 我有一个表tblStkMst2 ,它具有87列和53,000行。 If I execute the following query it takes 83 to 96 milliseconds (Core2 Duo, 2.8 GHz, 2 GB of RAM). 如果执行以下查询,则需要83到96毫秒(Core2 Duo,2.8 GHz,2 GB RAM)。 But when I use a distinct keyword it takes 1086 to 1103 milliseconds (more than 1 second). 但是,当我使用一个独特的关键字时,它需要1086到1103毫秒(超过1秒)。 It is really expensive. 真的很贵。 If I apply duplicate removal algorithm on 53,000 rows of data it does not take 1 seconds. 如果我对53,000行数据应用重复删除算法,则不会花费1秒。

Is there any other way in SQL Server 2005 to improve execution time? SQL Server 2005中还有其他方法可以缩短执行时间吗?

declare @monthOnly int                  set @monthOnly = 12
declare @yearOnly int                   set @yearOnly = 2011

SELECT  --(distinct)--

tblSModelMst.SMNo as [ModelID] 
,tblSModelMst.Vehicle as [ModelName]

FROM tblStkMst2 

INNER JOIN tblDCDetail ON tblStkMst2.DCNo = tblDCDetail.DCNo AND tblDCDetail.Refund=0 
INNER JOIN tblSModelMst ON tblStkMst2.SMno = tblSModelMst.SMNo 
INNER JOIN tblBuyerMst ON tblDCDetail.BNo = tblBuyerMst.BNo 
LEFT OUTER JOIN tblSModelSegment ON tblSModelMst.SMSeg = tblSModelSegment.ID
left outer JOIN dbo.tblProdManager as pd ON pd.PMID = tblBuyerMst.PMId


WHERE   (pd.Active = 1) AND ((tblStkMst2.ISSFlg = 1) or  (tblStkMst2.IsBooked = 1))
    AND (MONTH(tblStkMst2.SIssDate) = @monthOnly) AND (YEAR(tblStkMst2.SIssDate) = @yearOnly)

It is not that DISTINCT is very expensive (this is only 53000 rows, which is tiny). 并不是说DISTINCT是非常昂贵的(只有53000行,这很小)。 You are seeing a significant performance difference because SQL server is choosing a completely different query plan when you add DISTINCT. 您会看到明显的性能差异,因为添加DISTINCT时SQL Server选择了完全不同的查询计划。 Without seeing the query plans it is very difficult to see what is happening. 没有看到查询计划,很难看到正在发生的事情。

There are a couple of things in your query though which you could do better which could significantly improve performance. 您的查询中有几件事可以做得更好,但可以显着提高性能。

(1) Avoid where clauses like this where you need to transform a column: (1)避免在需要转换列的地方使用诸如此类的where子句:

AND (MONTH(tblStkMst2.SIssDate) = @monthOnly) AND (YEAR(tblStkMst2.SIssDate) = @yearOnly)

If you have an index on the SIssDate column SQL Server won't be able to use it (it will likely do a table scan as I suspect it won't be able to use another index). 如果您在SIssDate列上有一个索引,SQL Server将无法使用它(它可能会进行表扫描,因为我怀疑它将无法使用另一个索引)。

If you want to take advantage of the SIssDate index, it is better if you try and convert the @monthOnly/@yearonly parameters into a min and max date and use these in the query: 如果要利用SIssDate索引,最好尝试将@ monthOnly / @ yearonly参数转换为最小日期和最大日期,并在查询中使用它们:

AND (tblStkMst2.SIssDate between @minDate and @maxDate);

If you have a surrogate primary key (which is the clustered index) on the table, it may be useful to do this before you run your query (assuming your surrogate primary key is called tblStkMst2_id) 如果表上有代理主键(即聚集索引),则在运行查询之前执行此操作可能很有用(假设代理主键称为tblStkMst2_id)

SELECT @minId=MIN(tblStkMst2_id), @maxId=(tblStkMst2_id)
FROM
tblStkMst2 WHERE tblStkMsg2.SIssDate between @minDate and @maxDate;

This should be very fast as SQL server should not even need to look at the table (just at the SIssDate non-clustered index and the tblStkMst2_id clustered index). 这应该非常快,因为SQL Server甚至不需要查看表(只需查看SIssDate非聚集索引和tblStkMst2_id聚集索引)。

Then you can do this in your main query (instead of the date check): 然后,您可以在主查询中执行此操作(而不是日期检查):

AND (tblStkMst2.tblStkMst2_id BETWEEN @minId and @maxId);

Using the clustered index is much faster than using a non-clustered index as the DB will be able to sequentially access these records (rather than going through the non-clustered index redirect). 使用聚集索引比使用非聚集索引要快得多,因为数据库将能够顺序访问这些记录(而不是通过非聚集索引重定向)。

(2) Delay the join to tblStkMst2 until after you do the DISTINCT (or GROUP BY). (2)将连接延迟到tblStkMst2,直到执行DISTINCT(或GROUP BY)之后。 The fewer entries in the DISTINCT (GROUP BY) the better. DISTINCT(GROUP BY)中的条目越少越好。

SQL Server optimizes to avoid worst-case execution. SQL Server进行了优化以避免最坏情况的执行。 This can lead it to prefer a suboptimal algorithm, like preferring a disk sort over a hash sort, just to be on the safe side. 这可能会使它偏爱次优算法,例如出于安全考虑,偏向于将磁盘排序优先于哈希排序。

For a limited number of distinct values, a hash sort is the fastest way to execute a distinct operation. 对于有限数量的不同值,哈希排序是执行distinct操作的最快方法。 A hash sort trades memory for execution speed. 哈希排序以内存换取执行速度。 But if you have a large number of values, the hash sort breaks down because the hash is too large to store in memory. 但是,如果您有大量值,则哈希排序会失败,因为哈希太大而无法存储在内存中。 So you need a way to tell SQL Server that the hash will fit into memory. 因此,您需要一种方法来告诉SQL Server哈希将适合内存。

One possible way to do that is to use a temporary table: 一种可能的方法是使用临时表:

declare @t (ModelID int, ModelName varchar(50))
insert @t (ModelID, ModelName) select ...your original query here...
select distinct ModelID, ModelName from @t

SQL Server will know the size of the temporary table, allowing it to choose a better algorithm in many cases. SQL Server将知道临时表的大小,从而使它在许多情况下可以选择更好的算法。

Several ways. 几种方法。

1 - Don't use DISTINCT 1-不要使用DISTINCT

2 - Create an index on TblSModelMst(SMNo) INCLUDE (Vehicle) , and index your other JOIN keys. 2-在TblSModelMst(SMNo) INCLUDE (Vehicle)上创建索引,并为其他JOIN键建立索引。

You really should figure out why you get duplicates and take care of that first. 您确实应该弄清楚为什么要得到重复的副本并首先要照顾它。 It's likely additional matching rows in one or more of your JOIN ed tables. 一个或多个JOIN ed表中可能有其他匹配行。

DISTINCT has it's place but is heavily overused to obscure data issues, and it's a very expensive operator, especially when you have a large number of rows you are filtering down from. DISTINCT有它的地方,但在很大程度上过度使用晦涩的数据问题,这是一个非常昂贵的操作,尤其是当你有大量你是从筛选下来的行。

To get a more complete answer you need to explain your data structure and what you are trying to achieve. 要获得更完整的答案,您需要解释您的数据结构以及您要实现的目标。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM