简体   繁体   English

SQL Server从大表中选择慢

[英]SQL Server slow select from large table

I have a table with about 20+ million records. 我有一张约有2000多万条记录的表格。

Structure is like: 结构如下:

EventId UNIQUEIDENTIFIER
SourceUserId UNIQUEIDENTIFIER
DestinationUserId UNIQUEIDENTIFIER
CreatedAt DATETIME
TypeId INT
MetaId INT

Table is receiving about 100k+ records each day. 表每天接收大约10万条记录。

I have indexes on each column except MetaId, as it is not used in 'where' clauses 我在除MetaId之外的每一列都有索引,因为它没有在'where'子句中使用

The problem is when i want to pick up eg. 问题是当我想要拿起例如。 latest 100 records for desired SourceUserId 所需SourceUserId的最新100条记录

Query sometimes takes up to 4 minutes to execute, which is not acceptable. 查询有时最多需要4分钟才能执行,这是不可接受的。

Eg. 例如。

SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND 
(
 TypeId IN (2, 3, 4)
    OR 
 (TypeId = 60 AND SrcMemberId != DstMemberId)
)
ORDER BY CreatedAt DESC

I can't do partitioning etc as I am using Standard version of SQL Server and Enterprise is too expensive. 我无法进行分区等,因为我使用的是标准版的SQL Server,而且Enterprise太贵了。

I also think that the table is quite small to be that slow. 我也认为这张表很小很慢。

I think the problem is with ORDER BY clause as db must go through much bigger set of data. 我认为问题在于ORDER BY子句,因为db必须经历更大的数据集。

Any ideas how to make it quicker ? 任何想法如何使它更快?

Perhaps relational database is not a good idea for that kind of data. 也许关系型数据库对于那种数据不是一个好主意。

Data is always being picked up ordered by CreatedAt DESC 始终通过CreatedAt DESC订购数据

Thank you for reading. 谢谢你的阅读。

PabloX PabloX

You'll likely want to create a composite index for this type of query - when the query runs slowly it is most likely choosing to scan down an index on the CreatedAt column and perform a residual filter on the SourceUserId value, when in reality what you want to happen is to jump directly to all records for a given SourceUserId ordered properly - to achieve this, you'll want to create a composite index primarily on SourceUserId (performing an equality check) and secondarily on CreateAt (to preserve the order within a given SourceUserId value). 您可能希望为此类查询创建复合索引 - 当查询运行缓慢时,很可能选择扫描CreatedAt列上的索引并对SourceUserId值执行残差过滤,实际上是什么想要发生的是直接跳转到正确排序的给定SourceUserId的所有记录 - 要实现这一点,你需要主要在SourceUserId上创建一个复合索引(执行相等性检查),然后在CreateAt上创建一个复合索引(以保留一个给定SourceUserId值)。 You may want to try adding the TypeId in as well, depending on the selectivity of this column. 您可能还想尝试添加TypeId,具体取决于此列的选择性。

So, the 2 that will most likely give the best repeatable performance (try them out and compare) would be: 因此,最有可能提供最佳可重复性能的2(尝试它们并进行比较)将是:

  1. Index on (SourceUserId, CreatedAt) 索引(SourceUserId,CreatedAt)
  2. Index on (SourceUserId, TypeId, CreatedAt) 索引(SourceUserId,TypeId,CreatedAt)

As always, there are also many other considerations to take into account with determining how/what/where to index, as Remus discusses in a separate answer one big consideration is covering the query vs. keeping lookups. 与往常一样,在确定索引的方式/内容/位置时还需要考虑许多其他因素,正如Remus在单独的答案中讨论的那样,一个重要的考虑因素是覆盖查询与保持查找。 Additionally you'll need to consider write volumes, possible fragmentation impact (if any) , singleton lookups vs. large sequential scans, etc., etc. 此外,您还需要考虑写入卷, 可能的碎片影响(如果有) ,单例查找与大型顺序扫描等等。

I have indexes on each column except MetaId 除了MetaId,我在每列上都有索引

Non-covering indexes will likely hit the 'tipping point' and the query would revert to a table scan. 非覆盖索引可能会达到“临界点” ,查询将恢复为表扫描。 Just adding an index on every column because it is used in a where clause does not equate good index design. 只是在每个列上添加索引,因为它在where子句中使用并不等于良好的索引设计。 To take your query for example, a good 100% covering index would be: 以您的查询为例,一个好的100%覆盖索引将是:

INDEX ON (SourceUserId , CreatedAt) INCLUDE (TypeId, SrcMemberId, DstMemberId)

Following index is also usefull, altough it still going to cause lookups: 以下索引也很有用,尽管它仍然会导致查找:

INDEX ON (SourceUserId , CreatedAt) INCLUDE (TypeId)

and finaly an index w/o any included column may help, but is just as likely will be ignored (depends on the column statistics and cardinality estimates): 最后一个没有任何包含列的索引可能会有所帮助,但同样可能会被忽略(取决于列统计和基数估计):

INDEX ON (SourceUserId , CreatedAt)

But a separate index on SourceUSerId and one on CreatedAt is basically useless for your query. 但是,对于您的查询,SourceUSerId上的单独索引和CreatedAt上的单独索引基本无用。

See Index Design Basics . 请参阅索引设计基础知识

The fact that the table has indexes built on GUID values, indicates a possible series of problems that would affect performance: 表具有基于GUID值构建的索引,这表明可能会影响性能的一系列问题:

  • High index fragmentation: since new GUIDs are generated randomly, the index cannot organize them in a sequential order and the nodes are spread unevenly. 高索引碎片:由于新的GUID是随机生成的,因此索引无法按顺序组织它们,并且节点的分布不均匀。
  • High number of page splits: the size of a GUID (16 bytes) causes many page splits in the index, since there's a greater chance than a new value wont't fit in the remaining space available in a page. 大量的页面拆分: GUID的大小(16个字节)会导致索引中的页面拆分很多,因为新的值不可能适合页面中剩余的空间。
  • Slow value comparison: comparing two GUIDs is a relatively slow operation because all 33 characters must be matched. 慢值比较:比较两个GUID是一个相对较慢的操作,因为必须匹配所有33个字符。

Here a couple of resources on how to investigate and resolve these problems: 这里有几个关于如何调查和解决这些问题的资源:

I would recomend getting the data in 2 sep var tables 我建议在2个sep var表中获取数据

INSERT INTO @Table1
SELECT * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND 
(
 TypeId IN (2, 3, 4)
)
INSERT INTO @Table2
SELECT * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND 
(
 (TypeId = 60 AND SrcMemberId != DstMemberId)
)

then apply a unoin from the selects, ordered and top. 然后从选择,有序和顶部应用unoin。 Limit the data from the get go. 限制来自get go的数据。

I suggest using a UNION: 我建议使用UNION:

SELECT TOP 100 x.*
  FROM (SELECT a.*
          FROM EVENTS a
         WHERE a.typeid IN (2, 3, 4)
        UNION ALL
        SELECT b.*
          FROM EVENTS b
         WHERE b.typeid = 60 
           AND b.srcmemberid != b.dstmemberid) x
 WHERE x.sourceuserid = '15b534b17-5a5a-415a-9fc0-7565199c3461'

We've realised a minor gain by moving to a BIGINT IDENTITY key for our event table; 我们通过移动到事件表的BIGINT IDENTITY键实现了微小的收益; by using that as a clustered primary key, we can cheat and use that for date ordering. 通过将其用作群集主键,我们可以作弊并将其用于日期排序。

我会确保CreatedAt正确编入索引

you could split the query in two with an UNION to avoid the OR (which can cause your index not to be used), something like 您可以使用UNION将查询拆分为两个以避免OR(这可能导致您的索引不被使用),类似于

   SElect * FROM(
 SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND TypeId IN (2, 3, 4)
UNION  SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461' 
 AND TypeId = 60 AND SrcMemberId != DstMemberId
)
ORDER BY CreatedAt DESC

Also, check that the uniqueidentifier indexes are not CLUSTERED. 另外,检查uniqueidentifier索引是否不是CLUSTERED。

If there are 100K records added each day, you should check your index fragmentation. 如果每天添加100K记录,则应检查索引碎片。 And rebuild or reorganize it accordingly. 并相应地重建或重组它。 More info : SQLauthority 更多信息: SQLauthority

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM