如何更新在包含超过2.5亿行的表中创建的2个新列

Question

I have to add 2 new columns col1 char(1) NULL , col2 char(1) NULL to a table which has more than 250 million rows. 我必须向具有超过2.5亿行的表中添加2个新列col1 char(1) NULL和col2 char(1) NULL 。 And I have to update the two columns with a value 1 for the existing 250 million rows. 而且我有一个值来更新两列1为现有的2.5亿行。

Then my SSIS package will update the table on daily basis in incremental order. 然后我的SSIS包将每天以增量顺序更新表。 The SSIS package will populate these two columns with whatever it is coming from the source table. SSIS包将使用来自源表的任何内容填充这两列。

How to accomplish this to do it very quickly since I have to update 250M rows? 如何实现这一点，以便快速完成，因为我必须更新250M行？

Thanks, man 谢啦

Answer 1

You didn't say the version of SQL Server you're using. 您没有说您正在使用的SQL Server版本。 Starting with SQL Server 2012 adding a new NOT NULL column with a default is, in most cases, instantaneous : only the table metadata is changed, and no rows are updated. 从SQL Server 2012开始，在大多数情况下，添加带有默认值的新NOT NULL列是瞬时的：仅更改表元数据，并且不更新任何行。 Thanks to Martin Smith for this information. 感谢Martin Smith提供此信息。 So in this version, you'd be better off dropping and recreating the columns. 所以在这个版本中，你最好放弃并重新创建列。

In prior versions, you could try something like this: 在以前的版本中，您可以尝试这样的事情：

WHILE 1 = 1 BEGIN
   WITH T AS (
      SELECT TOP (10000) *
      FROM dbo.YourTable
      WHERE
         T.Col1 IS NULL
         AND T.COl2 IS NULL
   )
   UPDATE T
   SET
      T.Col1 = '1',
      T.Col2 = '1'
   ;
   IF @@RowCount < 10000 BREAK; -- a trick to save one iteration most times
END;

This could take a long time to run but has the benefit that it will not hold a lock on the table for a long time. 这可能需要很长时间才能运行，但是它的好处是它将不会长时间在表上保持锁。 The exact combination of indexes and the usual row sizes are also going to affect how well it performs. 索引和通常行大小的确切组合也会影响它的执行效果。 The sweet spot for the number of rows to update is never constant. 要更新的行数的最佳位置永远不会是恒定的。 It could be 50,000, or 2,000. 它可能是50,000或2,000。 I have experimented with different counts in the past in chunked operations like this and found that 5,000 or 10,000 are usually pretty close to the optimum size. 我曾经在这样的分块操作中尝试过不同的计数，发现5,000或10,000通常非常接近最佳尺寸。

The above query could also be benefited, depending on the version of SQL Server (2008 and up), by a filtered index: 根据SQL Server的版本（2008及更高版本），上面的查询还可以从筛选索引中受益：

CREATE UNIQUE NONCLUSTERED INDEX IX_YourTable ON dbo.YourTable (ClusteredColumns)
   WHERE Col1 IS NULL AND COl2 IS NULL;

When you are done, drop the index. 完成后，删除索引。

Note that if you had specified your two new columns with defaults and NOT NULL they would have the values added during the column creation--after which the default could then be dropped: 请注意，如果您用默认值和NOT NULL指定了两个新列，则它们将在创建列时添加值-然后可以删除默认值：

ALTER TABLE dbo.YourTable ADD Col1 char(1)
   NOT NULL CONSTRAINT DF_YourTable_Col1 DEFAULT ('1');

Unlike adding NULL columns to the end, which can be done lickety split, this could have taken a significant amount of time, so on your 250M row table this may not have been an option. 与向末尾添加NULL列不同，这可能会进行舔分，这可能需要花费大量时间，因此在您的250M行表上这可能不是一个选项。

UPDATE To address Bryan's comment: 更新要解决布莱恩的评论：

The rationale of doing it in small batches of 10,000 is that the negative effects of the update's "overhead" are largely ameliorated. 小批量进行10,000次的理由是，可以大大缓解更新的“开销”带来的负面影响。 Yes, indeed, it will be a LOT of activity--but it won't block for very long, and that is the #1 performance-harming effect of an activity like this: blocking for a long period. 是的，确实，这将是一项很多活动 - 但它不会长时间阻止，这就是这样一项活动的第一个影响性能的因素：阻塞很长一段时间。
We have lots of knowledge of the locking potential of this query: an UPDATE EXCLUSIVE lock, and the prior point should keep any harmful effects from this to a minimum. 我们对该查询的锁定潜力有很多了解：UPDATE EXCLUSIVE锁定，并且先前的点应将由此产生的任何有害影响降至最低。 Please share if there are additional locking concerns that I'm missing. 如果还有其他我想念的锁定问题，请分享。
The filtered index helps because it will allow reading only a few pages of the index, followed by a seek into the giant table. 筛选后的索引会有所帮助，因为它将只允许读取索引的几页，然后再查找巨型表。 Due to the update, true, the filtered index will have to be maintained to remove the updated rows since they no longer qualify, and this does increase the cost of the write portion of the update. 由于更新，为真，因此必须维护过滤的索引以删除更新的行，因为它们不再符合条件，这确实增加了更新的写入部分的成本。 That sounds bad until you realize that the biggest part of the batched UPDATE above, without some kind of index, will be a table scan each time . 这听起来很糟糕，直到你意识到上面批量UPDATE的最大部分，没有某种索引，每次都会进行表扫描 。 Given 250M rows, that requires the same resources as 12,500 complete scans of the entire table!!! 给定250M行，这需要与整个表的12,500次完整扫描相同的资源！ So my suggestion to use the index DOES work, and is a nice and easy shortcut alternative to walking the clustered index manually. 因此，我的建议是使用索引DOES，但它是手动遍历聚集索引的一种不错而又简便的捷径。
The "basic laws of indexes" that they are bad for tables which have lots of write actions doesn't hold here. 它们对具有大量写操作的表不利的“索引基本定律”在这里不成立。 You are thinking of normal OLTP access patterns where the row being updated can be found with a seek, and then for a write, every additional index on the table will indeed create overhead that did not before exist. 您正在考虑正常的OLTP访问模式，其中可以使用搜索找到正在更新的行，然后对于写入，表上的每个附加索引确实会创建之前不存在的开销。 Compare this to the explanation in my previous point. 将此与我上一点的解释进行比较。 Even if the filtered index makes the UPDATE part take 5 times as much I/O per row (doubtful), that will still be a reduction in I/O of over 2,500 times!!! 即使过滤后的索引使UPDATE部分占用每行I / O的5倍（可疑），这仍然会使I / O减少超过2500次！ . 。

Evaluating the performance impact of the update is important, especially if the table is incredibly busy and constantly being use. 评估更新对性能的影响很重要，尤其是在表非常繁忙且不断被使用的情况下。 If needed, scheduling it during off hours (if such exist) is, just as you suggested, basic sense. 如您所建议的那样，如果需要，将其安排在下班时间（如果有的话）是基本的意义。

One potential weak point in my suggestion is that in SQL 2008 and below, adding the filtered index could take a long time--though maybe not, since it is a VERY narrow index and will be written in clustered order (probably with a single scan!). 我建议的一个潜在弱点是，在SQL 2008及更低版本中，添加过滤后的索引可能需要很长时间 - 尽管可能不是，因为它是一个非常窄的索引并且将以群集顺序编写（可能只需一次扫描）！）。 So if it does take too long to create, there is an alternative: walk the clustered index manually. 因此，如果创建时间过长，则有另一种选择：手动遍历聚集索引。 That might look like this: 这可能看起来像这样：

DECLARE @ClusteredID int = 0; --assume clustered index is a single int column
DECLARE @Updated TABLE (
   ClusteredID int NOT NULL
);

WHILE 1 = 1 BEGIN
   WITH T AS (
      SELECT TOP (10000) *
      FROM dbo.YourTable
      WHERE ClusteredID > @ClusteredID -- the "walking" part
      ORDER BY ClusteredID -- also crucial for "walking"
   )
   UPDATE T
   SET
      T.Col1 = '1',
      T.Col2 = '1'
   OUTPUT Inserted.ClusteredID INTO @Updated
   ;

   IF @@RowCount = 0 BREAK;

   SELECT @ClusteredID = Max(ClusteredID)
   FROM @Updated
   ;

   DELETE @Updated;
END;

There you go: no index, seeks all the way, and only one effective scan of the entire table (with a tiny bit of overhead dealing with the table variable). 你去：没有索引，一直寻找，只有一个有效的整个表扫描（处理表变量的一点点开销）。 If the ClusteredID column is densely packed, you can probably even dispense with the table variable and just add 10,000 manually at the end of each loop. 如果ClusteredID列密集，您甚至可以省去表变量，并在每个循环结束时手动添加10,000。

You provided an update that you have 5 columns in your clustered index. 您提供了一个更新，您的聚簇索引中有5列。 Here's an updated script to show how you might accommodate that: 这是一个更新的脚本，显示了您可能如何适应它：

DECLARE -- Five random data types seeded with guaranteed low values
   @Clustered1 int = 0,
   @Clustered2 int = 0,
   @Clustered3 varchar(10) = '',
   @Clustered4 datetime = '19000101',
   @Clustered5 int = 0
;

DECLARE @Updated TABLE (
   Clustered1 int,
   Clustered2 int,
   Clustered3 varchar(10),
   Clustered4 datetime,
   Clustered5 int
);

WHILE 1 = 1 BEGIN
   WITH T AS (
      SELECT TOP (10000) *
      FROM dbo.YourTable
      WHERE
         Clustered1 > @Clustered1
         OR (
            Clustered1 = @Clustered1
            AND (
               Clustered2 > @Clustered2
               OR (
                  Clustered2 = @Clustered2
                  AND (
                     Clustered3 > @Clustered3
                     OR (
                        Clustered3 = @Clustered3
                        AND (
                           Clustered4 > @Clustered4
                           OR (
                              Clustered4 = @Clustered4
                              AND Clustered5 > @Clustered5
                           )
                        )
                     )
                  )
               )
            )
         )
      ORDER BY
         Clustered1, -- also crucial for "walking"
         Clustered2,
         Clustered3,
         Clustered4,
         Clustered5
   )
   UPDATE T
   SET
      T.Col1 = '1',
      T.Col2 = '1'
   OUTPUT
      Inserted.Clustered1,
      Inserted.Clustered2,
      Inserted.Clustered3,
      Inserted.Clustered4,
      Inserted.Clustered5
   INTO @Updated
   ;

   IF @@RowCount < 10000 BREAK;

   SELECT TOP (1)
     @Clustered1 = Clustered1
     @Clustered2 = Clustered2,
     @Clustered3 = Clustered3,
     @Clustered4 = Clustered4,
     @Clustered5 = Clustered5
   FROM @Updated
   ORDER BY
      Clustered1,
      Clustered2,
      Clustered3,
      Clustered4,
      Clustered5
   ;

   DELETE @Updated;
END;

If you find that one particular way of doing it doesn't work, try another. 如果您发现一种特定的方式行不通，请尝试另一种方式。 Understanding the database system at a deeper level will lead to better ideas and superior solutions. 在更深层次上理解数据库系统将带来更好的想法和卓越的解决方案。 I know the deeply-nested WHERE condition is a doozy. 我所知道的深度嵌套的WHERE条件是一个谎言。 You could try the following on for size as well--this works exactly the same but is much harder to understand so I can't really recommend it, even though adding additional columns is very easy. 你可以尝试对规模以下以及-这工作完全一样，但更难理解，所以我真的不能推荐它，即使加上额外的列是很容易的。

WITH T AS (
   SELECT TOP (10000) *
   FROM
      dbo.YourTable T
   WHERE
      122 <=
         CASE WHEN Clustered1 > @Clustered1 THEN 172 WHEN Clustered1 = @Clustered1 THEN 81 ELSE 0 END
         + CASE WHEN Clustered2 > @Clustered2 THEN 54 WHEN Clustered1 = @Clustered2 THEN 27 ELSE 0 END
         + CASE WHEN Clustered3 > @Clustered3 THEN 18 WHEN Clustered3 = @Clustered3 THEN 9 ELSE 0 END
         + CASE WHEN Clustered4 > @Clustered4 THEN 6 WHEN Clustered4 = @Clustered4 THEN 3 ELSE 0 END
         + CASE WHEN Clustered5 > @Clustered5 THEN 2 WHEN Clustered5 = @Clustered5 THEN 1 ELSE 0 END
   ORDER BY
      Clustered1, -- also crucial for "walking"
      Clustered2,
      Clustered3,
      Clustered4,
      Clustered5
)
UPDATE T
SET
   T.Col1 = '1',
   T.Col2 = '1'
OUTPUT
   Inserted.Clustered1,
   Inserted.Clustered2,
   Inserted.Clustered3,
   Inserted.Clustered4,
   Inserted.Clustered5
INTO @Updated
;

I have many times performed updates on gigantic tables with this exact "walk-the-clustered-index in small batches" strategy with no ill effect on the production database. 我有很多次使用这种精确的“小批量步行聚集索引”策略对巨型表进行更新，对生产数据库没有任何不良影响。

如何更新在包含超过2.5亿行的表中创建的2个新列

问题描述

1 个解决方案

解决方案1
7 已采纳 2013-06-21 21:14:05

如何更新在包含超过2.5亿行的表中创建的2个新列

问题描述

1 个解决方案

解决方案1 7 已采纳 2013-06-21 21:14:05

解决方案1
7 已采纳 2013-06-21 21:14:05