简体   繁体   English

Redshift UPDATE令人望而却步

[英]Redshift UPDATE prohibitively slow

I have a table in a Redshift cluster with ~1 billion rows. 我在Redshift集群中有一个表,行数约为10亿。 I have a job that tries to update some column values based on some filter. 我有一个作业试图根据一些过滤器更新一些列值。 Updating anything at all in this table is incredibly slow. 在这个表中更新任何东西都非常慢。 Here's an example: 这是一个例子:

SELECT col1, col2, col3
FROM SOMETABLE
WHERE col1 = 'a value of col1'
  AND col2 = 12;

The above query returns in less than a second, because I have sortkeys on col1 and col2 . 上面的查询在不到一秒的时间内返回,因为我在col1col2上有sortkeys。 There is only one row that meets this criteria, so the result set is just one row. 只有一行符合此条件,因此结果集只有一行。 However, if I run: 但是,如果我跑:

UPDATE SOMETABLE
SET col3 = 20
WHERE col1 = 'a value of col1'
  AND col2 = 12;

This query takes an unknown amount of time (I stopped it after 20 minutes). 此查询需要一段未知的时间(我在20分钟后停止了)。 Again, it should be updating one column value of one row. 同样,它应该更新一行的一个列值。

I have also tried to follow the documentation here: http://docs.aws.amazon.com/redshift/latest/dg/merge-specify-a-column-list.html , which talks about creating a temporary staging table to update the main table, but got the same results. 我还尝试按照此处的文档: http//docs.aws.amazon.com/redshift/latest/dg/merge-specify-a-column-list.html ,其中讨论了如何创建临时临时表以进行更新主表,但得到了相同的结果。

Any idea what is going on here? 知道这里发生了什么吗?

You didn't mention what percentage of the table you're updating but it's important to note that an UPDATE in Redshift is a 2 step process: 您没有提到要更新的表的百分比,但重要的是要注意Redshift中的UPDATE是一个两步过程:

  1. Each row that will be changed must be first marked for deletion 必须首先将要更改的每一行标记为删除
  2. Then a new version of the data must be written for each column in the table 然后,必须为表中的每列写入新版本的数据

If you have a large number of columns and/or are updating a large number of rows then this process can be very labor intensive for the database. 如果您有大量列和/或正在更新大量行,则此过程对于数据库来说可能非常耗费人力。

You could experiment with using a CREATE TABLE AS statement to create a new "updated" version of the table and then dropping the existing table and renaming the new table. 您可以尝试使用CREATE TABLE AS语句CREATE TABLE AS的新“更新”版本,然后删除现有表并重命名新表。 This has the added benefit of leaving you with a fully sorted table. 这样做的另一个好处是可以为您提供完全排序的表格。

Actually I don't think RedShift is designed for bulk updates, RedShift is designed for OLAP instead of OLTP, update operations are inefficient on RedShift by nature. 实际上我不认为RedShift是为批量更新而设计的,RedShift是专为OLAP而不是OLTP而设计的,RedShift的更新操作本质上是低效的。

In this use case, I would suggest to do INSERT instead of UPDATE, while add another column of the TIMESTAMP, and when you do analysis on RedShift, you'll need extra logic to get the latest TIMESTAMP to eliminate possible duplicated data entries. 在这个用例中,我建议执行INSERT而不是UPDATE,同时添加TIMESTAMP的另一列,当您对RedShift进行分析时,您需要额外的逻辑来获取最新的TIMESTAMP以消除可能的重复数据条目。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM