简体   繁体   中英

Slow loading into ultra-wide tables on Redshift

I have a few ultra-wide tables (1500+ columns) which I am trying to load data into. I am loading GZIPped files from S3 using a manifest file. The distkey of the table is 'date' and each file in S3 contains information for one particular date only. The columns are mostly float s, with a few date s and varchar s.

Each file has approximately 16000 rows with 1500 columns, and is approximately 84 MiB gzipped. Even following best practices for loading, we are seeing very poor load performance: 100 records/s or approximately 300 kB/s.

Are there any suggestions for improving load speeds specifically for ultra-wide tables? I'm loading data into narrower tables using similar techniques with fairly reasonable speeds, so I have reason to believe that this is an artifact of the width of the table.

Having files separated by the DISTKEY field does not necessarily improve load speed . Amazon Redshift will use multiple nodes to import files in parallel. The node that reads one particular input file will not necessarily be the same node used to store the data. Therefore, the data will be sent between nodes (which is expected during a load process).

If the table has been newly created, then the load process will automatically use the first 100,000 rows to determine an optimal compression type for each column . It will then delete that data and restart the load process. To avoid this, either create the table with compression defined on each column or run the COPY command with the COMPUPDATE option set to OFF. If, on the other hand, there is already data in the table, then this automatic process will be skipped.

It is possible that the load process is consuming too much memory and is spilling to disk. Try increasing wlm_query_slot_count increase the memory available to the COPY command. However, I'm not sure that this parameter applies to COPY commands (it is for 'queries', and the COPY command might not qualify as a query).

Adding for future reference: One optimization that helped was switching from Gzipped JSON to CSV files. This reduced each file from 84 MiB to 11 MiB and tripled the loading speed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM