简体   繁体   中英

Redshift unload to S3 is extremely slow

I'm using a ds2.xlarge Redshift cluster in US West with about 1TB of data. I'm trying to UNLOAD a 50GB table to an S3 bucket in the same region as follows:

UNLOAD ('select * from table_name') TO 's3://bucket/folder_name/'
CREDENTIALS 'aws_access_key_id=foo;aws_secret_access_key=bar'
MANIFEST;

This query takes about 1 hour to run. This seems surprising since the Amazon website says that we'll have an I/O of 0.5GB/s for our cluster, which means the 50GB table should take less than 2 minutes to be uploaded to S3, not an hour. (20-30x times slower than advertised)

Has anyone else run into this issue and/or found a fix / workaround? If we decide to use Redshift, we will need to move about 200GB of data from Redshift to S3 every day.

It's very expensive for Redshift to "re-materialize" complete rows. That's why the S3 unload is much slower than the total disk I/O.

The data is stored on disk in a manner that's optimised for retrieving a single column. Recreating the full rows generates (effectively) random I/O access. Your unload would be much faster on an SSD based node type.

If you want to verify this you can write all of the columns (delimited) into a table with 1 VARCHAR(MAX) column - which will be quite slow. And then unload that table - which will be much faster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM