简体繁体 English

哪一个在 redshift 中性能更高 - 截断后跟插入或删除并创建表为？

[英]Which one is more performant in redshift - Truncate followed with Insert Into or Drop and Create Table As?

原文 2021-01-07 05:49:20 3 1 amazon-web-services/ amazon-redshift

I have been working on AWS Redshift and kind of curious about which of the data loading (full reload) method is more performant.我一直在研究 AWS Redshift，有点好奇哪种数据加载（完全重新加载）方法的性能更高。

Approach 1 (Using Truncate):方法 1（使用截断）：

Truncate the existing table截断现有表
Load the data using Insert Into Select statement使用 Insert Into Select 语句加载数据

Approach 2 (Using Drop and Create):方法 2（使用拖放和创建）：

Drop the existing table删除现有表
Load the data using Create Table As Select statement使用 Create Table As Select 语句加载数据

We have been using both in our ETL, but I am interested in understanding what's happening behind the scene on AWS side.我们一直在 ETL 中使用这两种方法，但我有兴趣了解 AWS 方面的幕后情况。

In my opinion - Drop and Create Table As statement should be more performant as it reduces the overhead of scanning/handling associated data blocks for table needed in Insert Into statement.在我看来 - Drop and Create Table As 语句应该更高效，因为它减少了在 Insert Into 语句中扫描/处理关联数据块的开销。 Moreover, truncate in AWS Redshift does not reseed identity columns - Redshift Truncate table and reset Identity?此外，AWS Redshift 中的截断不会重新设置身份列 - Redshift Truncate table and reset Identity?

Please share your thoughts.请分享你的想法。

1 个解决方案

Redshift operates on 1MB blocks as the base unit of storage and coherency. Redshift 在 1MB 块上运行，作为存储和一致性的基本单元。 When changes are made to a table it is these blocks that are "published" for all to see when the changes are committed.当对表进行更改时，这些块将被“发布”以供所有人查看何时提交更改。 A table is just a list (data structure) of block ids that compose it and since there can be many versions of a table in flight at any time (if it is being changed while others are viewing it).表只是组成它的块 id 的列表（数据结构），因为在任何时候都可能有许多版本的表在运行（如果在其他人正在查看它时正在更改它）。

For the sake of the is question let's assume that the table in question is large (contains a lot of data) which I expect is true.为了这个问题，让我们假设有问题的表很大（包含大量数据），我希望这是真的。 These two statements end up doing a common action - unlinking and freeing all the blocks in the table.这两个语句最终执行了一个共同的操作 - 取消链接并释放表中的所有块。 The blocks is where all the data exists so you'd think that the speed of these two are the same and on idle systems they are close.这些块是所有数据存在的地方，因此您会认为这两者的速度是相同的，并且在空闲系统上它们很接近。 Both automatically commit the results so the command doesn't complete until the work is done.两者都会自动提交结果，因此在工作完成之前命令不会完成。 In this idle system comparison I've seen DROP run faster but then you need to CREATE the table again so there is time needed to recreate the data structure of the table but this can be in a transaction block so do we need to include the COMMIT?在这个空闲系统比较中，我看到 DROP 运行得更快，但是您需要再次创建表，因此需要时间来重新创建表的数据结构，但这可以在事务块中，所以我们需要包含 COMMIT ? The bottom line is that in the idle system these two approaches are quite close in runtime and when I last measured them out for a client the DROP approach was a bit faster.底线是，在空闲系统中，这两种方法在运行时非常接近，当我上次为客户测量它们时，DROP 方法要快一些。 I would advise you to read on before making your decision.我建议您在做出决定之前继续阅读。

However, in the real world Redshift clusters are rarely idle and in loaded cases these two statements can be quite different.然而，在现实世界中，Redshift 集群很少空闲，在负载情况下，这两个语句可能完全不同。 DROP requires exclusive control over the table since it does not run inside of a transaction block. DROP 需要对表进行独占控制，因为它不在事务块内运行。 All other uses of the table must be closed (committed or rolled-back) before DROP can execute.在执行 DROP 之前，必须关闭（提交或回滚）表的所有其他用途。 So if you are performing this DROP/recreate procedure on a table others are using the DROP statement will be blocked until all these uses complete.因此，如果您在表上执行此 DROP/recreate 过程，其他人正在使用 DROP 语句将被阻止，直到所有这些使用完成。 This can take an in-determinant amount of time to happen.这可能需要一段不确定的时间才能发生。 For ETL processing on "hidden" or "unpublished" tables the DROP/recreate method can work but you need to be really careful about what other sessions are accessing the table in question.对于“隐藏”或“未发布”表的 ETL 处理，DROP/recreate 方法可以工作，但您需要非常小心哪些其他会话正在访问相关表。

Truncate does run inside of a transaction but performs a commit upon completion. Truncate 确实在事务内部运行，但在完成时执行提交。 This means that it won't be blocked by others working with the table.这意味着它不会被使用该表的其他人阻止。 It's just that one version of the table is full (for those who were looking at it before truncate ran) and one version is completely empty.只是表的一个版本是满的（对于那些在 truncate 运行之前查看它的人），一个版本是完全空的。 The data structure of the table has versions for each session that has it open and each sees the blocks (or lack of blocks) that corresponds to their version.该表的数据结构具有每个 session 打开的版本，并且每个都看到与其版本对应的块（或缺少块）。 I suspect that it is managing these data structures and propagating these changes through the commit queue that slows TRUNCATE down slightly - bookkeeping.我怀疑它正在管理这些数据结构并通过提交队列传播这些更改，这会稍微减慢 TRUNCATE - 簿记。 The upside for this bookkeeping is that TRUNCATE will not be blocked by other sessions reading the table.这种记账的好处是 TRUNCATE 不会被其他读取表格的会话阻塞。

The deciding factors on choosing between these approaches is often not performance, it is which one has the locking and coherency features that will work in your solution.在这些方法之间进行选择的决定性因素通常不是性能，而是哪一种具有在您的解决方案中起作用的锁定和一致性特性。