[英]Insert sorted data to a table with nonclustered index
My db schema: 我的数据库模式:
point
( point_id int PK, name varchar
); point
( point_id int PK, name varchar
); point_log
( point_log_id int PK, point_id int FK, timestamp datetime, value int
) point_log
( point_log_id int PK, point_id int FK, timestamp datetime, value int
) point_log
has an index: point_log
有一个索引:
point_log_idx1 (point_id asc, timestamp asc)
I need to insert point log samples to point_log table, in each transaction only insert log samples for the one point_id, and the log samples are already sorted ascendingly. 我需要将点日志样本插入到point_log表中,在每个事务中,仅插入一个point_id的日志样本,并且日志样本已经按升序排序。 That means the all the log samples data in a transaction is in the same order for the
index( point_log_idx1)
, how can I make SQL Server to take advantage of this, to avoid the the tree search cost? 这意味着事务中的所有日志样本数据的
index( point_log_idx1)
顺序都相同,我如何才能使SQL Server充分利用这一点,从而避免树搜索成本?
This looks like a good opportunity for changing the clustered index on Point_Log
to cluster by its parent point_id
Foreign Key: 这看起来像改变聚集索引的好机会
Point_Log
其父集群point_id
外键:
CREATE TABLE Point_log
(
point_log_id int PRIMARY KEY NONCLUSTERED,
point_id int,
timestamp datetime,
value int
);
And then: 接着:
CREATE CLUSTERED INDEX C_PointLog ON dbo.Point_log(point_id);
Rationale: This will reduce the read IO on point_log
when fetching point_log
records for a given pointid
原理:在获取给定
pointid
point_log
记录时,这将减少point_log
上的读取IO
Moreover, given that Sql Server will add a 4 byte uniquifier to a non-unique clustered index, you may as well include the Surrogate PK on the Cluster as well, to make it unique, viz: 此外,鉴于Sql Server将向非唯一的聚集索引添加一个4字节的唯一化符 ,您也可以在群集上也包括替代PK,以使其唯一,即:
CREATE UNIQUE CLUSTERED INDEX C_PointLog ON dbo.Point_log(point_id, point_log_id);
The non clustered index point_log_idx1 ( point_id asc, timestamp asc)
would need to be retained if you have a large number of point_logs
per point
, and assuming good selectivity of queries filtering on point_log.pointid
& point_log.timestamp
如果每个
point
都有很多point_logs
,并且假设对point_log.pointid
和point_log.timestamp
筛选的查询具有良好的选择性,则需要保留非聚集索引point_log_idx1 ( point_id asc, timestamp asc)
The tree search cost is probably negligible compared to the cost of physical writing to disk and page splitting and logging. 与物理写入磁盘,页面拆分和日志记录的成本相比,树搜索的成本可以忽略不计。
1) You should definitely insert data in bulk, rather than row by row. 1)您绝对应该批量插入数据,而不是逐行插入数据。
2) To reduce page splitting of the point_log_idx1 index you can try to use ORDER BY
in the INSERT
statement. 2)为了减少point_log_idx1索引的页面拆分,您可以尝试在
INSERT
语句中使用ORDER BY
。 It still doesn't guarantee the physical order on disk, but it does guarantee the order in which point_log_id IDENTITY
would be generated, and hopefully it will hint to process source data in this order. 它仍然不能保证磁盘上的物理顺序,但是可以保证将生成point_log_id
IDENTITY
的顺序,并希望它将提示以该顺序处理源数据。 If source data is processed in the requested order, then the b-tree structure of the point_log_idx1 index may grow without unnecessary costly page splits. 如果按请求的顺序处理源数据,则point_log_idx1索引的b树结构可能会增长,而不会造成不必要的昂贵页面拆分。
I'm using SQL Server 2008. I have a system that collects a lot of monitoring data in a central database 24/7. 我正在使用SQL Server2008。我有一个系统在24/7中央数据库中收集大量监视数据。 Originally I was inserting data as it arrived, row by row.
最初,我是逐行插入数据。 Then I realized that each insert was a separate transaction and most of the time system spent writing into the transaction log.
然后我意识到每个插入都是一个单独的事务,并且系统大部分时间都花在了写入事务日志中。
Eventually I moved to inserting data in batches using stored procedure that accepts table-valued parameter. 最终,我开始使用接受表值参数的存储过程来批量插入数据。 In my case a batch is few hundred to few thousand rows.
就我而言,一批是几百到几千行。 In my system I keep data only for a given number of days, so I regularly delete obsolete data.
在我的系统中,我仅将数据保留给定的天数,因此我会定期删除过时的数据。 To keep the system performance stable I rebuild my indexes regularly as well.
为了保持系统性能稳定,我还定期重建索引。
In your example, it may look like the following. 在您的示例中,它可能如下所示。
First, create a table type: 首先,创建一个表类型:
CREATE TYPE [dbo].[PointValuesTableType] AS TABLE(
point_id int,
timestamp datetime,
value int
)
Then procedure would look like this: 然后过程如下所示:
CREATE PROCEDURE [dbo].[InsertPointValues]
-- Add the parameters for the stored procedure here
@ParamRows dbo.PointValuesTableType READONLY
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
BEGIN TRANSACTION;
BEGIN TRY
INSERT INTO dbo.point_log
(point_id
,timestamp
,value)
SELECT
TT.point_id
,TT.timestamp
,TT.value
FROM @ParamRows AS TT
ORDER BY TT.point_id, TT.timestamp;
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
END CATCH;
END
In practice you should measure for your system what is more efficient, with ORDER BY
, or without. 在实践中,无论使用
ORDER BY
还是不使用ORDER BY
,您都应该为系统测量更有效的方法。 You really need to consider performance of the INSERT
operation as well as performance of subsequent queries. 您确实需要考虑
INSERT
操作的性能以及后续查询的性能。
It may be that faster inserts lead to higher fragmentation of the index, which leads to slower queries. 更快的插入可能导致更高的索引碎片,从而导致更慢的查询。
So, you should check the fragmentation of the index after INSERT
with ORDER BY
or without. 因此,在使用
ORDER BY
或不使用ORDER BY
INSERT
之后,应该检查索引的碎片。 You can use sys.dm_db_index_physical_stats to get index stats. 您可以使用sys.dm_db_index_physical_stats来获取索引统计信息。
Returns size and fragmentation information for the data and indexes of the specified table or view in SQL Server.
返回SQL Server中指定表或视图的数据和索引的大小和碎片信息。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.