简体   繁体   中英

Update column in Amazon Redshift with join for big tables

I have 500M rows with 30 columns table (with bigint ID column), lets call it big_one . Also, I have another one table extra_one with the same number of rows and the same ID column, but two new columns with extra data that I'd like to include in the first table. I added two extra columns into the first table and want to update the data based on join.

Query is quite easy:

update big_one set
    col1=extra_one.col1,
    col2=extra_one.col2
from extra_one
where big_one.id=extra_one.id;

But during execution the disk space usage dramatically increased up to 100%. Before the start I had 23.41% of free space on 4 nodes (160GB each, 640GB total). The big_one table initially used about 18% of space. This 23.41% indicates that I had about 490GB of free disk space to perform updates smoothly. But Redhisft thinks differently.

磁盘空间使用

Two new columns are md5 hashes (so they're 32 chars length) (ideally it should take up to 16GB of space).

Recap:

  1. I have a wide table big_one .

  2. Have another table extra_one (with 3 columns total), with same IDs and number of records.

  3. I added two new columns to big_one .

  4. I want to enrich big_one with data from extra_one . (into that 2 new columns)

Q1: Any advice on how to perform such big updates?

Q2: If I will create the VIEW where will join two tables and then use it, will it save me from such space drain situations? How does Redshift work with VIEWs (not materialized) in such cases.

Do not use UPDATE on a large number of rows.

When a row is modified in Amazon Redshift, the existing row is marked as Deleted and a new row is appended to the table. This will effectively double the size of the table and wastes a lot of disk space until the table is Vacuumed. It is also very slow!

Instead:

  • Create a query that JOINs the two tables
  • Use the query to populate a new table (see below)
  • Delete the old table and rename the new table so that it replaces the original table (or, truncate the original table and copy the data back into it)

You can use CREATE TABLE LIKE to create a new, empty table based on an existing table.

From CREATE TABLE - Amazon Redshift :

LIKE parent_table [ { INCLUDING | EXCLUDING } DEFAULTS ]
A clause that specifies an existing table from which the new table automatically copies column names, data types, and NOT NULL constraints. The new table and the parent table are decoupled, and any changes made to the parent table aren't applied to the new table. Default expressions for the copied column definitions are copied only if INCLUDING DEFAULTS is specified. The default behavior is to exclude default expressions, so that all columns of the new table have null defaults.

Tables created with the LIKE option don't inherit primary and foreign key constraints. Distribution style, sort keys,BACKUP, and NULL properties are inherited by LIKE tables, but you can't explicitly set them in the CREATE TABLE ... LIKE statement.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM