简体   繁体   English

SQL Server 2008+聚簇索引的排序顺序

[英]Sort order of an SQL Server 2008+ clustered index

Does the sort order of a SQL Server 2008+ clustered index impact the insert performance? SQL Server 2008+聚簇索引的排序顺序是否会影响插入性能?

The datatype in the specific case is integer and the inserted values are ascending ( Identity ). 特定情况下的数据类型为integer ,插入的值为升序( Identity )。 Therefore, the sort order of the index would be opposite to the sort order of the values to be inserted. 因此,索引的排序顺序将与要插入的值的排序顺序相反。

My guess is, that it will have an impact, but I don't know, maybe SQL Server has some optimizations for this case or it's internal data storage format is indifferent to this. 我的猜测是,它会产生影响,但我不知道,也许SQL Server对这种情况有一些优化,或者它的内部数据存储格式对此无动于衷。

Please note that the question is about the INSERT performance, not SELECT . 请注意,问题是关于INSERT性能,而不是SELECT

Update 更新
To be more clear about the question: What happens when the values which will be inserted ( integer ) are in reverse order ( ASC ) to the ordering of the clustered index ( DESC )? 更清楚的问题是:当插入的值( integer )与聚簇索引( DESC )的排序顺序相反( ASC )时会发生什么?

There is a difference. 它们是有区别的。 Inserting out of Cluster Order causes massive fragmentation. 插入群集顺序会导致大量碎片。

When you run the following code the DESC clustered index is generating additional UPDATE operations at the NONLEAF level. 当您运行以下代码时,DESC聚集索引将在NONLEAF级别生成其他UPDATE操作。

CREATE TABLE dbo.TEST_ASC(ID INT IDENTITY(1,1) 
                            ,RandNo FLOAT
                            );
GO
CREATE CLUSTERED INDEX cidx ON dbo.TEST_ASC(ID ASC);
GO

CREATE TABLE dbo.TEST_DESC(ID INT IDENTITY(1,1) 
                            ,RandNo FLOAT
                            );
GO
CREATE CLUSTERED INDEX cidx ON dbo.TEST_DESC(ID DESC);
GO

INSERT INTO dbo.TEST_ASC VALUES(RAND());
GO 100000

INSERT INTO dbo.TEST_DESC VALUES(RAND());
GO 100000

The two Insert statements produce exactly the same Execution Plan but when looking at the operational stats the differences show up against [nonleaf_update_count]. 两个Insert语句产生完全相同的执行计划,但在查看操作统计数据时,差异显示在[nonleaf_update_count]上。

SELECT 
OBJECT_NAME(object_id)
,* 
FROM sys.dm_db_index_operational_stats(DB_ID(),OBJECT_ID('TEST_ASC'),null,null)
UNION
SELECT 
OBJECT_NAME(object_id)
,* 
FROM sys.dm_db_index_operational_stats(DB_ID(),OBJECT_ID('TEST_DESC'),null,null)

There is an extra –under the hood- operation going on when SQL is working with DESC index that runs against the IDENTITY. 当SQL使用针对IDENTITY运行的DESC索引时,会发生额外的操作。 This is because the DESC table is becoming fragmented (rows inserted at the start of the page) and additional updates occur to maintain the B-tree structure. 这是因为DESC表变得碎片化(在页面开头插入的行)并且发生了额外的更新以维护B树结构。

The most noticeable thing about this example is that the DESC Clustered Index becomes over 99% fragmented. 关于这个例子最引人注目的是DESC聚集索引的碎片超过99%。 This is recreating the same bad behaviour as using a random GUID for a clustered index. 这会重现与使用聚簇索引的随机GUID相同的不良行为。 The below code demonstrates the fragmentation. 以下代码演示了碎片。

SELECT 
OBJECT_NAME(object_id)
,* 
FROM sys.dm_db_index_physical_stats  (DB_ID(), OBJECT_ID('dbo.TEST_ASC'), NULL, NULL ,NULL) 
UNION
SELECT 
OBJECT_NAME(object_id)
,* 
FROM sys.dm_db_index_physical_stats  (DB_ID(), OBJECT_ID('dbo.TEST_DESC'), NULL, NULL ,NULL) 

UPDATE: 更新:

On some test environments I'm also seeing that the DESC table is subject to more WAITS with an increase in [page_io_latch_wait_count] and [page_io_latch_wait_in_ms] 在某些测试环境中,我也看到DESC表受到更多WAITS的影响,并增加了[page_io_latch_wait_count]和[page_io_latch_wait_in_ms]

UPDATE: 更新:

Some discussion has arisen about what is the point of a Descending Index when SQL can perform Backward Scans. 当SQL可以执行向后扫描时,已经出现了关于降序索引的重点的讨论。 Please read this article about the limitations of Backward Scans . 请阅读本文关于Backward Scans限制

The order of values inserted into a clustered index most certainly impacts performance of the index, by potentially creating a lot of fragmentation, and also affects the performance of the insert itself. 插入到聚簇索引中的值的顺序肯定会影响索引的性能,可能会产生大量碎片,并且还会影响插入本身的性能。

I've constructed a test-bed to see what happens: 我构建了一个试验台,看看会发生什么:

USE tempdb;

CREATE TABLE dbo.TestSort
(
    Sorted INT NOT NULL
        CONSTRAINT PK_TestSort
        PRIMARY KEY CLUSTERED
    , SomeData VARCHAR(2048) NOT NULL
);

INSERT INTO dbo.TestSort (Sorted, SomeData)
VALUES  (1797604285, CRYPT_GEN_RANDOM(1024))
    , (1530768597, CRYPT_GEN_RANDOM(1024))
    , (1274169954, CRYPT_GEN_RANDOM(1024))
    , (-1972758125, CRYPT_GEN_RANDOM(1024))
    , (1768931454, CRYPT_GEN_RANDOM(1024))
    , (-1180422587, CRYPT_GEN_RANDOM(1024))
    , (-1373873804, CRYPT_GEN_RANDOM(1024))
    , (293442810, CRYPT_GEN_RANDOM(1024))
    , (-2126229859, CRYPT_GEN_RANDOM(1024))
    , (715871545, CRYPT_GEN_RANDOM(1024))
    , (-1163940131, CRYPT_GEN_RANDOM(1024))
    , (566332020, CRYPT_GEN_RANDOM(1024))
    , (1880249597, CRYPT_GEN_RANDOM(1024))
    , (-1213257849, CRYPT_GEN_RANDOM(1024))
    , (-155893134, CRYPT_GEN_RANDOM(1024))
    , (976883931, CRYPT_GEN_RANDOM(1024))
    , (-1424958821, CRYPT_GEN_RANDOM(1024))
    , (-279093766, CRYPT_GEN_RANDOM(1024))
    , (-903956376, CRYPT_GEN_RANDOM(1024))
    , (181119720, CRYPT_GEN_RANDOM(1024))
    , (-422397654, CRYPT_GEN_RANDOM(1024))
    , (-560438983, CRYPT_GEN_RANDOM(1024))
    , (968519165, CRYPT_GEN_RANDOM(1024))
    , (1820871210, CRYPT_GEN_RANDOM(1024))
    , (-1348787729, CRYPT_GEN_RANDOM(1024))
    , (-1869809700, CRYPT_GEN_RANDOM(1024))
    , (423340320, CRYPT_GEN_RANDOM(1024))
    , (125852107, CRYPT_GEN_RANDOM(1024))
    , (-1690550622, CRYPT_GEN_RANDOM(1024))
    , (570776311, CRYPT_GEN_RANDOM(1024))
    , (2120766755, CRYPT_GEN_RANDOM(1024))
    , (1123596784, CRYPT_GEN_RANDOM(1024))
    , (496886282, CRYPT_GEN_RANDOM(1024))
    , (-571192016, CRYPT_GEN_RANDOM(1024))
    , (1036877128, CRYPT_GEN_RANDOM(1024))
    , (1518056151, CRYPT_GEN_RANDOM(1024))
    , (1617326587, CRYPT_GEN_RANDOM(1024))
    , (410892484, CRYPT_GEN_RANDOM(1024))
    , (1826927956, CRYPT_GEN_RANDOM(1024))
    , (-1898916773, CRYPT_GEN_RANDOM(1024))
    , (245592851, CRYPT_GEN_RANDOM(1024))
    , (1826773413, CRYPT_GEN_RANDOM(1024))
    , (1451000899, CRYPT_GEN_RANDOM(1024))
    , (1234288293, CRYPT_GEN_RANDOM(1024))
    , (1433618321, CRYPT_GEN_RANDOM(1024))
    , (-1584291587, CRYPT_GEN_RANDOM(1024))
    , (-554159323, CRYPT_GEN_RANDOM(1024))
    , (-1478814392, CRYPT_GEN_RANDOM(1024))
    , (1326124163, CRYPT_GEN_RANDOM(1024))
    , (701812459, CRYPT_GEN_RANDOM(1024));

The first column is the primary key, and as you can see the values are listed in random(ish) order. 第一列是主键,您可以看到值以随机(ish)顺序列出。 Listing the values in random order should make SQL Server either: 以随机顺序列出值应该使SQL Server成为:

  1. Sort the data, pre-insert 对数据进行排序, 预插入
  2. Not sort the data, resulting in a fragmented table. 不对数据进行排序,导致表格碎片化。

The CRYPT_GEN_RANDOM() function is used to generate 1024 bytes of random data per row, to allow this table to consume multiple pages, which in turn allows us to see the effects of fragmented inserts. CRYPT_GEN_RANDOM()函数用于每行生成1024字节的随机数据,以允许此表使用多个页面,从而使我们能够看到碎片插入的效果。

Once you run the above insert, you can check fragmentation like this: 运行上面的插入后,您可以像这样检查碎片:

SELECT * 
FROM sys.dm_db_index_physical_stats(DB_ID(), OBJECT_ID('TestSort'), 1, 0, 'SAMPLED') ips;

Running this on my SQL Server 2012 Developer Edition instance shows average fragmentation of 90%, indicating SQL Server did not sort during the insert. 在我的SQL Server 2012 Developer Edition实例上运行此操作会显示90%的平均碎片,表明SQL Server在插入期间没有排序。

The moral of this particular story is likely to be, "when in doubt, sort, if it will be beneficial". 这个特定故事的寓意可能是,“如果有疑问,可以分类,如果它会有益”。 Having said that, adding and ORDER BY clause to an insert statement does not guarantee the inserts will occur in that order. 话虽如此,将insert和ORDER BY子句添加到insert语句并不能保证插入按顺序发生。 Consider what happens if the insert goes parallel, as an example. 考虑插入并行时会发生什么,例如。

On non-production systems you can use trace flag 2332 as an option on the insert statement to "force" SQL Server to sort the input prior to inserting it. 在非生产系统上,您可以使用跟踪标志2332作为insert语句的选项,以“强制”SQL Server在插入之前对输入进行排序。 @PaulWhite has an interesting article, Optimizing T-SQL queries that change data covering that, and other details. @PaulWhite有一篇有趣的文章, 优化T-SQL查询,改变覆盖该数据的数据 ,以及其他细节。 Be aware, that trace flag is unsupported, and should NOT be used in production systems, since that might void your warranty. 请注意,该跟踪标志不受支持,不应在生产系统中使用,因为这可能会使保修失效。 In a non-production system, for your own education, you can try adding this to the end of the INSERT statement: 在非生产系统中,对于您自己的教育,您可以尝试将其添加到INSERT语句的末尾:

OPTION (QUERYTRACEON 2332);

Once you have that appended to the insert, take a look at the plan, you'll see an explicit sort: 将附加到插入后,查看计划,您将看到一个明确的排序:

在此输入图像描述

It would be great if Microsoft would make this a supported trace flag. 如果微软将这个作为支持的跟踪标志,那将是很好的。

Paul White made me aware that SQL Server does automatically introduce a sort operator into the plan when it thinks one will be helpful. 保罗·怀特让我知道的SQL Server 自动引入一种运营商进入该计划时,它认为一个会有所帮助。 For the sample query above, if I run the insert with 250 items in the values clause, no sort is implemented automatically. 对于上面的示例查询,如果我在values子句中运行带有250个项目的插入,则不会自动执行排序。 However, at 251 items, SQL Server automatically sorts the values prior to the insert. 但是,在251项中,SQL Server会在插入之前自动对值进行排序。 Why the cutoff is 250/251 rows remains a mystery to me, other than it seems to be hard-coded. 为什么截止值是250/251行对我来说仍然是一个谜,除了它似乎是硬编码。 If I reduce the size of the data inserted in the SomeData column to just one byte, the cutoff is still 250/251 rows, even though the size of the table in both cases is just a single page. 如果我将SomeData列中插入的数据大小减少到只有一个字节,则截止仍然是250/251行,即使两种情况下表的大小只是一个页面。 Interestingly, looking at the insert with SET STATISTICS IO, TIME ON; 有趣的是,用SET STATISTICS IO, TIME ON;查看插入SET STATISTICS IO, TIME ON; shows the inserts with a single byte SomeData value take twice as long when sorted. 显示带有单个字节的插入SomeData值在排序时需要两倍的时间。

Without the sort (ie 250 rows inserted): 没有排序(即插入250行):

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 16 ms, elapsed time = 16 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
Table 'TestSort'. Scan count 0, logical reads 501, physical reads 0, 
   read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob 
   read-ahead reads 0.

(250 row(s) affected)

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 11 ms.

With the sort (ie 251 rows inserted): 排序(即插入251行):

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 15 ms, elapsed time = 17 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
Table 'TestSort'. Scan count 0, logical reads 503, physical reads 0, 
   read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob 
   read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, 
   read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob 
   read-ahead reads 0.

(251 row(s) affected)

(1 row(s) affected)

 SQL Server Execution Times:
   CPU time = 16 ms,  elapsed time = 21 ms.

Once you start to increase the row size, the sorted version certainly becomes more efficient. 一旦开始增加行大小,排序版本肯定会变得更有效。 When inserting 4096 bytes into SomeData , the sorted insert is nearly twice as fast on my test rig as the unsorted insert. SomeData插入4096个字节时,在我的测试装备上排序的插入速度几乎是未排序插入的两倍。


As a side-note, in case you're interested, I generated the VALUES (...) clause using this T-SQL: 作为附注,如果您感兴趣,我使用此T-SQL生成VALUES (...)子句:

;WITH s AS (
    SELECT v.Item
    FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)) v(Item)
)
, v AS (
    SELECT Num = CONVERT(int, CRYPT_GEN_RANDOM(10), 0)
)
, o AS (
    SELECT v.Num
        , rn = ROW_NUMBER() OVER (PARTITION BY v.Num ORDER BY NEWID())
    FROM s s1
        CROSS JOIN s s2
        CROSS JOIN s s3
        CROSS JOIN v 
)
SELECT TOP(50) ', (' 
        + REPLACE(CONVERT(varchar(11), o.Num), '*', '0') 
        + ', CRYPT_GEN_RANDOM(1024))'
FROM o
WHERE rn = 1
ORDER BY NEWID();

This generates 1,000 random values, selecting only the top 50 rows with unique values in the first column. 这将生成1,000个随机值,仅选择第一列中具有唯一值的前50行。 I copied-and-pasted the output into the INSERT statement above. 我将输出复制并粘贴到上面的INSERT语句中。

As long as the data comes ordered by the clustered index (irrespective if it's ascending or descending), then there should not be any impact on the insert performance. 只要数据按聚集索引排序(无论是上升还是下降),就不会对插入性能产生任何影响。 The reasoning behind this is that SQL does not care of the physical order of the rows in a page for the clustered index. 这背后的原因是SQL不关心聚集索引的页面中行的物理顺序。 The order of the rows is kept in what is called a "Record Offset Array", which is the only one that needs to be rewritten for a new row (which anyway would have been done irrespective of order). 行的顺序保存在所谓的“记录偏移数组”中,这是唯一一个需要为新行重写的数据(无论顺序如何,无论如何都会这样做)。 The actual data rows will just get written one after the other. 实际数据行将一个接一个地写入。

At transaction log level, the entries should be identical irrespective of the direction so this will not generate any additional impact on performance. 在事务日志级别,条目应该是相同的,与方向无关,因此不会对性能产生任何额外影响。 Usually the transaction log is the one that generates most of the performance issues, but in this case there will be none. 通常,事务日志是产生大多数性能问题的日志,但在这种情况下将没有。

You can find a good explanation on the physical structure of a page / row here https://www.simple-talk.com/sql/database-administration/sql-server-storage-internals-101/ . 您可以在这里找到关于页面/行的物理结构的详细说明https://www.simple-talk.com/sql/database-administration/sql-server-storage-internals-101/

So basically as long as your inserts will not generate page splits (and if the data comes in the order of the clustered index irrespective of order it will not), your inserts will have negligible if any impact on the insert performance. 所以基本上只要你的插入不会生成页面拆分(如果数据按聚集索引的顺序排列而不管顺序如何),如果对插入性能有任何影响,你的插入将是微不足道的。

Based on the code below, inserting data into an identity column with a sorted clustered index is more resource intense when the selected data is ordered in the opposite direction of the sorted clustered index. 基于以下代码,当所选数据按排序聚簇索引的相反方向排序时,将数据插入具有排序聚簇索引的标识列中会更加资源密集。

In this example, logical reads are nearly double. 在此示例中,逻辑读取几乎是两倍。

After 10 runs, the sorted ascending logical reads average 2284 and the sorted descending logical reads average 4301. 在10次运行之后,排序的升序逻辑读取平均值2284,排序的降序逻辑读取值平均值4301。

--Drop Table Destination;
Create Table Destination (MyId INT IDENTITY(1,1))

Create Clustered Index ClIndex On Destination(MyId ASC)

set identity_insert destination on 
Insert into Destination (MyId)
SELECT TOP (1000) n = ROW_NUMBER() OVER (ORDER BY [object_id]) 
FROM sys.all_objects 
ORDER BY n


set identity_insert destination on 
Insert into Destination (MyId)
SELECT TOP (1000) n = ROW_NUMBER() OVER (ORDER BY [object_id]) 
FROM sys.all_objects 
ORDER BY n desc;

More about logical reads if you are interested: https://www.brentozar.com/archive/2012/06/tsql-measure-performance-improvements/ 有关逻辑读取的更多信息,如果您感兴趣: https//www.brentozar.com/archive/2012/06/tsql-measure-performance-improvements/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM