简体   繁体   English

将排序后的数据插入具有非聚集索引的表中

[英]Insert sorted data to a table with nonclustered index

My db schema: 我的数据库模式:

  • Table point ( point_id int PK, name varchar ); pointpoint_id int PK, name varchar );
  • Table point_log ( point_log_id int PK, point_id int FK, timestamp datetime, value int ) point_logpoint_log_id int PK, point_id int FK, timestamp datetime, value int

point_log has an index: point_log有一个索引:

point_log_idx1 (point_id asc, timestamp asc)

I need to insert point log samples to point_log table, in each transaction only insert log samples for the one point_id, and the log samples are already sorted ascendingly. 我需要将点日志样本插入到point_log表中,在每个事务中,仅插入一个point_id的日志样本,并且日志样本已经按升序排序。 That means the all the log samples data in a transaction is in the same order for the index( point_log_idx1) , how can I make SQL Server to take advantage of this, to avoid the the tree search cost? 这意味着事务中的所有日志样本数据的index( point_log_idx1)顺序都相同,我如何才能使SQL Server充分利用这一点,从而避免树搜索成本?

This looks like a good opportunity for changing the clustered index on Point_Log to cluster by its parent point_id Foreign Key: 这看起来像改变聚集索引的好机会Point_Log其父集群point_id外键:

CREATE TABLE Point_log
( 
    point_log_id int PRIMARY KEY NONCLUSTERED, 
    point_id int, 
    timestamp datetime, 
    value int
);

And then: 接着:

CREATE CLUSTERED INDEX C_PointLog ON dbo.Point_log(point_id);

Rationale: This will reduce the read IO on point_log when fetching point_log records for a given pointid 原理:在获取给定pointid point_log记录时,这将减少point_log上的读取IO

Moreover, given that Sql Server will add a 4 byte uniquifier to a non-unique clustered index, you may as well include the Surrogate PK on the Cluster as well, to make it unique, viz: 此外,鉴于Sql Server将向非唯一的聚集索引添加一个4字节的唯一化 ,您也可以在群集上也包括替代PK,以使其唯一,即:

CREATE UNIQUE CLUSTERED INDEX C_PointLog ON dbo.Point_log(point_id, point_log_id);

The non clustered index point_log_idx1 ( point_id asc, timestamp asc) would need to be retained if you have a large number of point_logs per point , and assuming good selectivity of queries filtering on point_log.pointid & point_log.timestamp 如果每个point都有很多point_logs ,并且假设对point_log.pointidpoint_log.timestamp筛选的查询具有良好的选择性,则需要保留非聚集索引point_log_idx1 ( point_id asc, timestamp asc)

The tree search cost is probably negligible compared to the cost of physical writing to disk and page splitting and logging. 与物理写入磁盘,页面拆分和日志记录的成本相比,树搜索的成本可以忽略不计。

1) You should definitely insert data in bulk, rather than row by row. 1)您绝对应该批量插入数据,而不是逐行插入数据。

2) To reduce page splitting of the point_log_idx1 index you can try to use ORDER BY in the INSERT statement. 2)为了减少point_log_idx1索引的页面拆分,您可以尝试在INSERT语句中使用ORDER BY It still doesn't guarantee the physical order on disk, but it does guarantee the order in which point_log_id IDENTITY would be generated, and hopefully it will hint to process source data in this order. 它仍然不能保证磁盘上的物理顺序,但是可以保证将生成point_log_id IDENTITY的顺序,并希望它将提示以该顺序处理源数据。 If source data is processed in the requested order, then the b-tree structure of the point_log_idx1 index may grow without unnecessary costly page splits. 如果按请求的顺序处理源数据,则point_log_idx1索引的b树结构可能会增长,而不会造成不必要的昂贵页面拆分。

I'm using SQL Server 2008. I have a system that collects a lot of monitoring data in a central database 24/7. 我正在使用SQL Server2008。我有一个系统在24/7中央数据库中收集大量监视数据。 Originally I was inserting data as it arrived, row by row. 最初,我是逐行插入数据。 Then I realized that each insert was a separate transaction and most of the time system spent writing into the transaction log. 然后我意识到每个插入都是一个单独的事务,并且系统大部分时间都花在了写入事务日志中。

Eventually I moved to inserting data in batches using stored procedure that accepts table-valued parameter. 最终,我开始使用接受表值参数的存储过程来批量插入数据。 In my case a batch is few hundred to few thousand rows. 就我而言,一批是几百到几千行。 In my system I keep data only for a given number of days, so I regularly delete obsolete data. 在我的系统中,我仅将数据保留给定的天数,因此我会定期删除过时的数据。 To keep the system performance stable I rebuild my indexes regularly as well. 为了保持系统性能稳定,我还定期重建索引。

In your example, it may look like the following. 在您的示例中,它可能如下所示。

First, create a table type: 首先,创建一个表类型:

CREATE TYPE [dbo].[PointValuesTableType] AS TABLE(
    point_id int,
    timestamp datetime,
    value int
)

Then procedure would look like this: 然后过程如下所示:

CREATE PROCEDURE [dbo].[InsertPointValues]
    -- Add the parameters for the stored procedure here
    @ParamRows dbo.PointValuesTableType READONLY
AS
BEGIN
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
    SET NOCOUNT ON;

    BEGIN TRANSACTION;
    BEGIN TRY

        INSERT INTO dbo.point_log
            (point_id
            ,timestamp
            ,value)
        SELECT
            TT.point_id
            ,TT.timestamp
            ,TT.value
        FROM @ParamRows AS TT
        ORDER BY TT.point_id, TT.timestamp;

        COMMIT TRANSACTION;
    END TRY
    BEGIN CATCH
        ROLLBACK TRANSACTION;
    END CATCH;

END

In practice you should measure for your system what is more efficient, with ORDER BY , or without. 在实践中,无论使用ORDER BY还是不使用ORDER BY ,您都应该为系统测量更有效的方法。 You really need to consider performance of the INSERT operation as well as performance of subsequent queries. 您确实需要考虑INSERT操作的性能以及后续查询的性能。

It may be that faster inserts lead to higher fragmentation of the index, which leads to slower queries. 更快的插入可能导致更高的索引碎片,从而导致更慢的查询。

So, you should check the fragmentation of the index after INSERT with ORDER BY or without. 因此,在使用ORDER BY或不使用ORDER BY INSERT之后,应该检查索引的碎片。 You can use sys.dm_db_index_physical_stats to get index stats. 您可以使用sys.dm_db_index_physical_stats来获取索引统计信息。

Returns size and fragmentation information for the data and indexes of the specified table or view in SQL Server. 返回SQL Server中指定表或视图的数据和索引的大小和碎片信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM