简体   繁体   English

B树,数据库,顺序插入与随机插入以及速度。 随机胜出

[英]B-trees, databases, sequential vs. random inserts, and speed. Random is winning

EDIT 编辑

@Remus corrected my test pattern. @Remus更正了我的测试模式。 You can see the corrected version on his answer below. 您可以在下面的答案中看到更正的版本。

I took the suggestion of replacing the INT with DECIMAL(29,0) and the results were: 我接受了用DECIMAL(29,0)替换INT的建议,结果是:

Decimal: 2133 小数:2133
GUID: 1836 GUID:1836年

Random inserts are still winning, even with a fractionally bigger row. 随机插入仍然是赢家,即使排成一小部分。

Despite explanations that indicate random inserts are slower than sequential ones, these benchmarks show they are apparently faster. 尽管有解释表明随机插入比顺序插入慢,但这些基准测试表明它们显然更快。 The explanations I'm getting aren't agreeing with the benchmarks. 我得到的解释与基准不同。 Therefore, my question remains focused on b-trees, sequential inserts, and speed. 因此,我的问题仍然集中在b树,顺序插入和速度上。

... ...

I know from experience that b-trees have awful performance when data is added to them sequentially (regardless of the direction). 从经验中我知道,当按顺序将数据添加到b树中时(无论方向如何),b树的性能都很差。 However, when data is added randomly, best performance is obtained. 但是,当随机添加数据时,可以获得最佳性能。

This is easy to demonstrate with the likes of an RB-Tree. 这很容易用RB-Tree来演示。 Sequential writes cause a maximum number of tree balances to be performed. 顺序写入会导致执行最大数量的树平衡。

I know very few databases use binary trees, but rather used n-order balanced trees. 我知道很少有数据库使用二叉树,而是使用n阶平衡树。 I logically assume they suffer a similar fate to binary trees when it comes to sequential inputs. 从逻辑上讲,我认为它们在顺序输入方面的命运与二叉树相似。

This sparked my curiosity. 这激发了我的好奇心。

If this is so, then one could deduce that writing sequential IDs (such as in IDENTITY(1,1)) would cause multiple re-balances of the tree to occur. 如果是这样,则可以推断出写入顺序ID(例如在IDENTITY(1,1)中)将导致树的多次重新平衡。 I have seen many posts argue against GUIDs as "these will cause random writes". 我已经看到许多帖子都反对GUID,因为它们“会导致随机写入”。 I never use GUIDs, but it struck me that this "bad" point was in fact a good point. 我从不使用GUID,但令我惊讶的是,这个“坏”点实际上是一个点。

So I decided to test it. 因此,我决定进行测试。 Here is my code: 这是我的代码:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[T1](
    [ID] [int] NOT NULL
 CONSTRAINT [T1_1] PRIMARY KEY CLUSTERED ([ID] ASC) 
)
GO

CREATE TABLE [dbo].[T2](
    [ID] [uniqueidentifier] NOT NULL
 CONSTRAINT [T2_1] PRIMARY KEY CLUSTERED ([ID] ASC)
)

GO

declare @i int, @t1 datetime, @t2 datetime, @t3 datetime, @c char(300)

set @t1 = GETDATE()
set @i = 1

while @i < 2000 begin
    insert into T2 values (NEWID(), @c)
    set @i = @i + 1
end

set @t2 = GETDATE()
WAITFOR delay '0:0:10'
set @t3 = GETDATE()
set @i = 1

while @i < 2000 begin
    insert into T1 values (@i, @c)
    set @i = @i + 1
end

select DATEDIFF(ms, @t1, @t2) AS [Int], DATEDIFF(ms, @t3, getdate()) AS [GUID]

drop table T1
drop table T2

Note that I am not subtracting any time for the creation of the GUID nor for the considerably extra size of the row. 请注意,我不会减少创建GUID的时间, 不会减少行的额外大小。 The results on my machine were as follows: 我的机器上的结果如下:

Int: 17,340 ms GUID: 6,746 ms 内部:17,340毫秒GUID:6,746毫秒

This means that in this test, random inserts of 16 bytes was almost 3 times faster than sequential inserts of 4 bytes . 这意味着在该测试中, 随机插入16个字节顺序插入4个字节 快3倍

Would anyone like to comment on this? 有人想对此发表评论吗?

Ps. 附言 I get that this isn't a question. 我知道这不是问题。 It's an invite to discussion, and that is relevant to learning optimum programming. 这是进行讨论的邀请,并且与学习最佳编程有关。

flip the operation and the int is faster..have you taken into account log and data file growth? 翻转操作,并且int更快。.您是否考虑了日志和数据文件的增长? Run each separately 分别运行

declare @i int, @t1 datetime, @t2 datetime

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T2 values (NEWID())
    set @i = @i + 1
END


set @t2 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T1 values (@i)
    set @i = @i + 1
end



select DATEDIFF(ms, @t1, @t2) AS [UID], DATEDIFF(ms, @t2, getdate()) AS [Int]

the problem with UUIDs is when clustering on them and not using NEWSEQUENTIALID() is that they cause page breaks and fragmentation of the table UUID的问题是在它们上集群而不使用NEWSEQUENTIALID()时,它们会导致分页符和表碎片

now try like this and you see it is almost the same 现在尝试这样,您会看到几乎相同

declare @i int, @t1 datetime, @t2 datetime

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T2 values (NEWID())
    set @i = @i + 1
END
select DATEDIFF(ms, @t1, getdate()) 

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T1 values (@i)
    set @i = @i + 1
end



select DATEDIFF(ms, @t1, getdate())

And reversed 并扭转

declare @i int, @t1 datetime, @t2 datetime



set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T1 values (@i)
    set @i = @i + 1
end

set @t1 = GETDATE()
set @i = 1

while @i < 10000 begin
    insert into T2 values (NEWID())
    set @i = @i + 1
END
select DATEDIFF(ms, @t1, getdate())

You are not measuring the INSERT speed. 您没有测量INSERT速度。 You are measuring your log flush performance. 您正在测量日志刷新性能。 Since you commit after each INSERT, all those tests are doing are sitting around waiting for commit to harden the log. 由于您是在每次INSERT之后提交的,因此所有这些测试都在等待提交以强化日志。 That is hardly relevant for INSERT performance. 这与INSERT性能几乎无关。 And please don't post 'performance' measurements when SET NOCOUNT is OFF ... 并且请不要在SET NOCOUNT为OFF时发布“性能”测量...

So lets try this without unnecessary server-client chatter, with a properly sized data, batch commits and pre-grown databases: 因此,在没有适当大小的数据,批处理提交和预先生成的数据库的情况下,让我们在没有不必要的服务器-客户端干扰的情况下尝试以下操作:

:setvar dbname testdb
:setvar testsize 1000000
:setvar batchsize 1000

use master;
go

if db_id('$(dbname)') is not null
begin
    drop database [$(dbname)];
end
go

create database [$(dbname)] 
    on (name='test_data', filename='c:\temp\test_data.mdf', size=10gb)
    log on (name='test_log', filename='c:\temp\test_log.ldf', size=100mb);
go

use [$(dbname)];
go  

CREATE TABLE [dbo].[T1](
    [ID] [int] NOT NULL
 CONSTRAINT [T1_1] PRIMARY KEY CLUSTERED ([ID] ASC) 
)
GO

CREATE TABLE [dbo].[T2](
    [ID] [uniqueidentifier] NOT NULL
 CONSTRAINT [T2_1] PRIMARY KEY CLUSTERED ([ID] ASC)
)
GO

set nocount on;
go

declare @i int, @t1 datetime, @t2 datetime

set @t1 = GETDATE()
set @i = 1

begin transaction;
while @i < $(testsize) begin
    insert into T1 values (@i)
    set @i = @i + 1
    if @i % $(batchsize) = 0
    begin
        commit;
        begin transaction;
    end
end
commit

set @t2 = GETDATE()
set @i = 1
begin transaction
while @i < $(testsize) begin
    insert into T2 values (NEWID())
    set @i = @i + 1
    if @i % $(batchsize) = 0
    begin
        commit;
        begin transaction;
    end
end
commit

select DATEDIFF(ms, @t1, @t2) AS [Int], DATEDIFF(ms, @t2, getdate()) AS [UID]

drop table T1
drop table T2

INTS: 18s INTS:18秒
GUIDS: 23s 指导:23秒

QED 优质教育

I expect in a real database rebalancing of an index being a minor problem, because lots of index entries will fit in a single block and as long. 我期望在实际的数据库中重新平衡索引是一个小问题,因为很多索引条目将适合单个块,而且需要很长的时间。

What might become more of an issue might be contention to that single block containing all the new entries. 可能成为问题的更多原因可能是争夺包含所有新条目的单个块。 Oracle has a feature to store the bytes of the key in reverse order to spread new entries out over all blocks: http://oracletoday.blogspot.com/2006/09/there-is-option-to-create-index.html Don't know about other databases. Oracle具有一种功能,可以以相反的顺序存储密钥的字节,以将新条目分布到所有块中: http : //oracletoday.blogspot.com/2006/09/there-is-option-to-create-index.html不了解其他数据库。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM