简体   繁体   English

将表传递给存储过程

[英]Passing a table to a stored procedure

I have a table with 20 billion rows. 我有一个有200亿行的表。 Table does not have any indexes as it was created on fly for doing bulk insert operation. 该表没有任何索引,因为它是为进行批量插入操作而动态创建的。 The table is being used in a stored procedure which does the following operation 该表正在存储过程中使用,该存储过程执行以下操作

Delete A
from master a 
inner join (Select distinct Col from TableB ) b
on A.Col = B.Col

Insert into master 
Select *
from tableB
group by col1,col2,col3

TableB is the one which has 20 billion rows. TableB是具有200亿行的表。 I don't want to execute SP directly because it might take days to complete the execution. 我不想直接执行SP,因为完成执行可能需要几天的时间。 Master is also a huge table and has clustered index on Col Master也是一个巨大的表,并且在Col上具有聚集索引

  1. Can i pass chunks of rows to the stored procedure and perform the operation.This might reduce the log file growth. 我可以将几行行传递给存储过程并执行操作吗?这可能会减少日志文件的增长。 If yes how can i do that 如果可以,我该怎么做
  2. Should i create clustered index on the table and execute the SP which might be little faster but then again i think creating CI on a huge table might take 10 hours to complete. 我是否应该在表上创建聚簇索引并执行SP,这可能会快一点,但我又想在一个巨大的表上创建CI可能需要10个小时才能完成。

Or is there any way to perform this operation fast 还是有什么方法可以快速执行此操作

I've used a method similar to this one . 我使用了与方法类似的方法。 I'd recommend putting your DB into Bulk Logged recovery mode instead of Full recovery mode if you can. 如果可以的话,建议您将数据库置于批量记录恢复模式,而不是完全恢复模式。

Blog entry reproduced below to future proof it. 下面复制了博客条目,以供将来证明。

Below is a technique used to transfer a large amount of records from one table to another. 下面是一种用于将大量记录从一个表转移到另一个表的技术。 This scales pretty well for a couple reasons. 由于几个原因,这可以很好地扩展。 First, this will not fill up the entire log prior to committing the transaction. 首先,这不会在提交事务之前填满整个日志。 Rather, it will populate the table in chunks of 10,000 records. 相反,它将以10,000条记录的块填充表。 Second, it's generally much quicker. 其次,通常要快得多。 You will have to play around with the batch size. 您将不得不处理批量大小。 Sometimes it's more efficient at 10,000, sometimes 500,000, depending on the system. 有时,根据系统的不同,效率更高,为10,000,有时为500,000。

If you do not need to insert into an existing table and just need a copy of the table, it is better to do a SELECT INTO . 如果您不需要插入现有表中而只需要该表的副本,则最好执行SELECT INTO However for this example, we are inserting into an existing table. 但是,对于本示例,我们将插入现有表中。

Another trick you should do is to change the recovery model of the database to simple. 您应该做的另一个技巧是将数据库的恢复模型更改为简单。 This way, there will be much less logging in the transaction log. 这样,将减少事务日志中的日志记录。

The WITH (TABLOCK) below only works in SQL 2008. 下面的WITH (TABLOCK)仅在SQL 2008中有效。

 DECLARE @BatchSize INT = 10000 WHILE 1 = 1 BEGIN INSERT INTO [dbo].[Destination] --WITH (TABLOCK) -- Uncomment for 2008 ( FirstName ,LastName ,EmailAddress ,PhoneNumber ) SELECT TOP(@BatchSize) s.FirstName ,s.LastName ,s.EmailAddress ,s.PhoneNumber FROM [dbo].[SOURCE] s WHERE NOT EXISTS ( SELECT 1 FROM dbo.Destination WHERE PersonID = s.PersonID ) IF @@ROWCOUNT < @BatchSize BREAK END 

With the above example, it is important to have at least a non clustered index on PersonID in both tables. 对于上面的示例,在两个表中的PersonID上至少具有非聚集索引很重要。

Another way to transfer records is to use multiple threads. 传输记录的另一种方法是使用多个线程。 Specifying a range of records as such: 指定这样的记录范围:

 INSERT INTO [dbo].[Destination] ( FirstName ,LastName ,EmailAddress ,PhoneNumber ) SELECT TOP(@BatchSize) s.FirstName ,s.LastName ,s.EmailAddress ,s.PhoneNumber FROM [dbo].[SOURCE] s WHERE PersonID BETWEEN 1 AND 5000 GO INSERT INTO [dbo].[Destination] ( FirstName ,LastName ,EmailAddress ,PhoneNumber ) SELECT TOP(@BatchSize) s.FirstName ,s.LastName ,s.EmailAddress ,s.PhoneNumber FROM [dbo].[SOURCE] s WHERE PersonID BETWEEN 5001 AND 10000 

For super fast performance however, I'd recommend using SSIS. 为了获得超快的性能,我建议使用SSIS。 Especially in SQL Server 2008. We recently transferred 17 million records in 5 minutes with an SSIS package executed on the same server as the two databases it transferred between. 尤其是在SQL Server 2008中。我们最近在5分钟内传输了1700万条记录,并且在同一台服务器上执行的SSIS包与在其之间传输的两个数据库一起执行。

SQL Server 2008 SQL Server 2008 has made changes with regards to it's logging mechanism when inserting records. SQL Server 2008 SQL Server 2008在插入记录时对其日志记录机制进行了更改。 Previously, to do an insert that was minimally logged, you would have to perform a SELECT.. INTO . 以前,要进行最少记录的插入,必须执行SELECT.. INTO Now, you can perform a minimally logged insert if you can lock the table you are inserting into. 现在,如果您可以锁定要插入的表,则可以执行最少记录的插入。 The example below shows an example of this. 下面的示例显示了一个示例。 The exception to this rule is if you have a clustered index on the table AND the table is not empty. 此规则的例外情况是,如果表上有聚集索引并且表不为空。 If the table is empty and you acquire a table lock and you have a clustered index, it will be minimally logged. 如果表为空,并且您获得了表锁,并且具有聚集索引,那么它将被最小化记录。 However if you have data in the table, the insert will be logged. 但是,如果表中有数据,则将记录插入。 Now if you have a non clustered index on a heap and you acquire a table lock then only the non clustered index will be logged. 现在,如果堆上具有非聚集索引并且获得了表锁,则将仅记录非聚集索引。 It is always better to drop indexes prior to inserting records. 最好在插入记录之前删除索引。

To determine the amount of logging you can use the following statement 要确定日志记录的数量,可以使用以下语句

  SELECT * FROM ::fn_dblog(NULL, NULL) 

Credit for above goes to Derek Dieter at SQL Server Planet. 以上内容归功于SQL Server Planet的Derek Dieter。

If you're dead set on passing a table to your stored procedure, you can pass a table-valued parameter to a stored procedure in SQL Server 2008 . 如果在将表传递到存储过程时一无所获,则可以在SQL Server 2008中将表值参数传递给存储过程 You might have better luck with some other approaches suggested, like partitioning. 建议使用其他一些方法(例如分区)可能会更好。 Select distinct on a table with 20 billion rows might be part of the problem. 在具有200亿行的表上选择“不重复”可能是问题的一部分。 I wonder if some very basic tuning wouldn't help, too: 我想知道一些非常基本的调整是否也无济于事:

Delete A
from master a 
where exists (select 1 from TableB b where b.Col = a.Col)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM