简体   繁体   English

更新1.2亿条记录的最快方式

[英]Fastest way to update 120 Million records

I need to initialize a new field with the value -1 in a 120 Million record table. 我需要在一个1.2亿记录表中初始化一个值为-1的新字段。

Update table
       set int_field = -1;

I let it run for 5 hours before canceling it. 我让它运行了5个小时才取消它。

I tried running it with transaction level set to read uncommitted with the same results. 我尝试运行它,事务级别设置为读取未提交的相同结果。

Recovery Model = Simple.
MS SQL Server 2005

Any advice on getting this done faster? 如何更快地完成这项工作?

The only sane way to update a table of 120M records is with a SELECT statement that populates a second table. 更新120M记录表的唯一有效方法是使用SELECT语句填充第二个表。 You have to take care when doing this. 这样做时你必须小心。 Instructions below. 说明如下。


Simple Case 简单案例

For a table w/out a clustered index, during a time w/out concurrent DML: 对于没有聚集索引的表,在w / out并发DML的时间内:

  • SELECT *, new_col = 1 INTO clone.BaseTable FROM dbo.BaseTable
  • recreate indexes, constraints, etc on new table 在新表上重新创建索引,约束等
  • switch old and new w/ ALTER SCHEMA ... TRANSFER. 切换旧的和新的w / ALTER SCHEMA ... TRANSFER。
  • drop old table 放下旧桌子

If you can't create a clone schema, a different table name in the same schema will do. 如果无法创建克隆架构,则同一架构中的其他表名称将起作用。 Remember to rename all your constraints and triggers (if applicable) after the switch. 切记在切换后重命名所有约束和触发器(如果适用)。


Non-simple Case 非简单案例

First, recreate your BaseTable with the same name under a different schema, eg clone.BaseTable . 首先,在不同的模式下重新创建具有相同名称的BaseTable ,例如clone.BaseTable Using a separate schema will simplify the rename process later. 使用单独的模式将在以后简化重命​​名过程。

  • Include the clustered index , if applicable. 如果适用,请包括聚簇索引 Remember that primary keys and unique constraints may be clustered, but not necessarily so. 请记住,主键和唯一约束可能是聚类的,但不一定如此。
  • Include identity columns and computed columns , if applicable. 如果适用,请包括标识列和计算列
  • Include your new INT column , wherever it belongs. 包括您的新INT列 ,无论它在哪里。
  • Do not include any of the following: 不包括以下任何内容:
    • triggers 触发器
    • foreign key constraints 外键约束
    • non-clustered indexes/primary keys/unique constraints 非聚集索引/主键/唯一约束
    • check constraints or default constraints. 检查约束或默认约束。 Defaults don't make much of difference, but we're trying to keep things minimal. 默认值没有多大区别,但我们试图将事情保持在最低限度。

Then, test your insert w/ 1000 rows: 然后,测试你的插件w / 1000行:

-- assuming an IDENTITY column in BaseTable
SET IDENTITY_INSERT clone.BaseTable ON
GO
INSERT clone.BaseTable WITH (TABLOCK) (Col1, Col2, Col3)
SELECT TOP 1000 Col1, Col2, Col3 = -1
FROM dbo.BaseTable
GO
SET IDENTITY_INSERT clone.BaseTable OFF

Examine the results. 检查结果。 If everything appears in order: 如果一切按顺序出现:

  • truncate the clone table 截断克隆表
  • make sure the database in in bulk-logged or simple recovery model 确保数据库处于批量记录或简单恢复模型中
  • perform the full insert. 执行完整插入。

This will take a while, but not nearly as long as an update. 这需要一段时间,但不会像更新那么长。 Once it completes, check the data in the clone table to make sure it everything is correct. 完成后,检查克隆表中的数据以确保一切正确。

Then, recreate all non-clustered primary keys/unique constraints/indexes and foreign key constraints (in that order). 然后,重新创建所有非群集主键/唯一约束/索引和外键约束(按此顺序)。 Recreate default and check constraints, if applicable. 如果适用,重新创建默认值并检查约束。 Recreate all triggers. 重新创建所有触发器。 Recreate each constraint, index or trigger in a separate batch. 在单独的批处理中重新创建每个约束,索引或触发器。 eg: 例如:

ALTER TABLE clone.BaseTable ADD CONSTRAINT UQ_BaseTable UNIQUE (Col2)
GO
-- next constraint/index/trigger definition here

Finally, move dbo.BaseTable to a backup schema and clone.BaseTable to the dbo schema (or wherever your table is supposed to live). 最后,将dbo.BaseTable移动到备份架构并将clone.BaseTable到dbo架构(或者您的表应该存在的任何位置)。

-- -- perform first true-up operation here, if necessary
-- EXEC clone.BaseTable_TrueUp
-- GO
-- -- create a backup schema, if necessary
-- CREATE SCHEMA backup_20100914
-- GO
BEGIN TRY
  BEGIN TRANSACTION
  ALTER SCHEMA backup_20100914 TRANSFER dbo.BaseTable
  -- -- perform second true-up operation here, if necessary
  -- EXEC clone.BaseTable_TrueUp
  ALTER SCHEMA dbo TRANSFER clone.BaseTable
  COMMIT TRANSACTION
END TRY
BEGIN CATCH
  SELECT ERROR_MESSAGE() -- add more info here if necessary
  ROLLBACK TRANSACTION
END CATCH
GO

If you need to free-up disk space, you may drop your original table at this time, though it may be prudent to keep it around a while longer. 如果您需要释放磁盘空间,可以在此时删除原始表,尽管保持一段时间可能是谨慎的。

Needless to say, this is ideally an offline operation. 不用说,这理想情况下是离线操作。 If you have people modifying data while you perform this operation, you will have to perform a true-up operation with the schema switch. 如果您在执行此操作时有人修改数据,则必须使用架构开关执行校正操作。 I recommend creating a trigger on dbo.BaseTable to log all DML to a separate table. 我建议在dbo.BaseTable上创建一个触发器,将所有DML记录到一个单独的表中。 Enable this trigger before you start the insert. 在开始插入之前启用此触发器。 Then in the same transaction that you perform the schema transfer, use the log table to perform a true-up. 然后,在执行模式传输的同一事务中,使用日志表执行校正。 Test this first on a subset of the data! 首先在数据子集上进行测试! Deltas are easy to screw up. Deltas很容易搞砸。

If you have the disk space, you could use SELECT INTO and create a new table. 如果您有磁盘空间,则可以使用SELECT INTO并创建新表。 It's minimally logged, so it would go much faster 它的记录最少,因此速度会快得多

select t.*, int_field = CAST(-1 as int)
into mytable_new 
from mytable t

-- create your indexes and constraints

GO

exec sp_rename mytable, mytable_old
exec sp_rename mytable_new, mytable

drop table mytable_old

I break the task up into smaller units. 我将任务分解为更小的单位。 Test with different batch size intervals for your table, until you find an interval that performs optimally. 为您的表测试不同的批处理大小间隔,直到找到最佳执行的间隔。 Here is a sample that I have used in the past. 这是我过去使用过的一个示例。

declare @counter int 
declare @numOfRecords int
declare @batchsize int

set @numOfRecords = (SELECT COUNT(*) AS NumberOfRecords FROM <TABLE> with(nolock))
set @counter = 0 
set @batchsize = 2500

set rowcount @batchsize
while @counter < (@numOfRecords/@batchsize) +1
begin 
set @counter = @counter + 1 
Update table set int_field = -1 where int_field <> -1;
end 
set rowcount 0

If your int_field is indexed, remove the index before running the update. 如果您的int_field已编制索引,请在运行更新之前删除索引。 Then create your index again... 然后再次创建索引......

5 hours seem like a lot for 120 million recs. 对于1.2亿人来说,5个小时似乎很多。

declare @cnt bigint
set @cnt = 1

while @cnt*100<10000000 
 begin

UPDATE top(100) [Imp].[dbo].[tablename]
   SET [col1] = xxxx       
 WHERE[col2] is null  

  print '@cnt: '+convert(varchar,@cnt)
  set @cnt=@cnt+1
  end
set rowcount 1000000
Update table set int_field = -1 where int_field<>-1

see how fast that takes, adjust and repeat as necessary 看看需要多快,调整和重复

What I'd try first is 我先尝试的是
to drop all constraints, indexes, triggers and full text indexes first before you update. 在更新之前首先删除所有约束,索引,触发器和全文索引。

If above wasn't performant enough, my next move would be 如果上面的表现不够好,我的下一步行动就是
to create a CSV file with 12 million records and bulk import it using bcp. 创建一个包含1200万条记录的CSV文件,并使用bcp批量导入它。

Lastly, I'd create a new heap table (meaning table with no primary key) with no indexes on a different filegroup, populate it with -1. 最后,我创建了一个新的堆表(意思是没有主键的表),在不同的文件组上没有索引,用-1填充它。 Partition the old table, and add the new partition using "switch". 对旧表进行分区,并使用“switch”添加新分区。

When adding a new column ("initialize a new field") and setting a single value to each existing row, I use the following tactic: 添加新列(“初始化新字段”)并为每个现有行设置单个值时,我使用以下策略:

ALTER TABLE MyTable
 add NewColumn  int  not null
  constraint MyTable_TemporaryDefault
   default -1

ALTER TABLE MyTable
 drop constraint MyTable_TemporaryDefault

If the column is nullable and you don't include a "declared" constraint, the column will be set to null for all rows. 如果列可以为空并且您不包含“声明”约束,则对于所有行,该列将设置为null。

Sounds like an indexing problem, like Pabla Santa Cruz mentioned. 听起来像索引问题,就像Pabla Santa Cruz提到的那样。 Since your update is not conditional, you can DROP the column and RE-ADD it with a DEFAULT value. 由于您的更新不是条件更新,因此您可以删除列并使用DEFAULT值重新添加它。

In general, recommendation are next: 一般来说,建议如下:

  1. Remove or just Disable all INDEXES, TRIGGERS, CONSTRAINTS on the table; 删除或仅禁用表格中的所有INDEXES,TRIGGERS,CONSTRAINTS;
  2. Perform COMMIT more often (eg after each 1000 records that were updated); 更频繁地执行COMMIT(例如,在每1000个已更新的记录之后);
  3. Use select ... into. 使用select ... into。

But in particular case you should choose the most appropriate solution or their combination. 但在特殊情况下,您应该选择最合适的解决方案或它们的组合。

Also bear in mind that sometime index could be useful eg when you perform update of non-indexed column by some condition. 还要记住,某些时候索引可能很有用,例如,当您通过某些条件执行非索引列的更新时。

If the table has an index which you can iterate over I would put update top(10000) statement in a while loop moving over the data. 如果表有一个可以迭代的索引,我会将update top(10000)语句放在while循环中移动数据。 That would keep the transaction log slim and won't have such a huge impact on the disk system. 这将使事务日志保持苗条,并且不会对磁盘​​系统产生如此巨大的影响。 Also, I would recommend to play with maxdop option (setting it closer to 1). 另外,我建议使用maxdop选项(将其设置为接近1)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM