简体   繁体   中英

Snowflake - Parallel processing and copy of large zip file to a SF Table

We have a process to load the data from a csv to a Snowflake Table. But as the input file is gzip format and after unzipping it is around 70 to 80 GB file. At present the process is like reading the gzip file and direct insert into the staging table. But, with the medium cluster it actually running around 3 to 3:30 hrs time to complete. Need to understand if any parallelism can be handled here for a faster processing.

CREATE OR REPLACE FILE FORMAT MANGEMENT.TEST_GZIP_FORMAT TYPE = CSV FIELD_DELIMITER = ';' SKIP_HEADER = 2 ESCAPE_UNENCLOSED_FIELD = NONE TRIM_SPACE = TRUE;


INSERT INTO TEST_DB.TEMP_TABLE (
                        emp_id, emp_name ) SELECT DISTINCT temp.$1 as emp_id,
                        temp.$2  AS emp_name   from
                        /Azureserverlocation/test/apps/ (file_format => MANAGEMENT.TEST_GZIP_FORMAT, pattern=>'./test_file.gz') temp;

can you break your process into 2 stages and use copy into .

  1. use COPY INTO to copy data into a stage table.
COPY INTO TEST_DB.TEMP_TABLE_STG 
from '/Azureserverlocation/test/apps/'
file_format = (format_name=MANAGEMENT.TEST_GZIP_FORMAT, pattern=>'./test_file.gz');
  1. Then get a distinct from stg table.
CREATE TABLE TEST_DB.TEMP_TABLE as 
SELECT DISTINCT emp_id, emp_name from TEST_DB.TEMP_TABLE_STG 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM