Snowflake - Parallel processing and copy of large zip file to a SF Table

Question

We have a process to load the data from a csv to a Snowflake Table. But as the input file is gzip format and after unzipping it is around 70 to 80 GB file. At present the process is like reading the gzip file and direct insert into the staging table. But, with the medium cluster it actually running around 3 to 3:30 hrs time to complete. Need to understand if any parallelism can be handled here for a faster processing.

CREATE OR REPLACE FILE FORMAT MANGEMENT.TEST_GZIP_FORMAT TYPE = CSV FIELD_DELIMITER = ';' SKIP_HEADER = 2 ESCAPE_UNENCLOSED_FIELD = NONE TRIM_SPACE = TRUE;


INSERT INTO TEST_DB.TEMP_TABLE (
                        emp_id, emp_name ) SELECT DISTINCT temp.$1 as emp_id,
                        temp.$2  AS emp_name   from
                        /Azureserverlocation/test/apps/ (file_format => MANAGEMENT.TEST_GZIP_FORMAT, pattern=>'./test_file.gz') temp;

Answer 1

can you break your process into 2 stages and use copy into .

use COPY INTO to copy data into a stage table.

COPY INTO TEST_DB.TEMP_TABLE_STG 
from '/Azureserverlocation/test/apps/'
file_format = (format_name=MANAGEMENT.TEST_GZIP_FORMAT, pattern=>'./test_file.gz');

Then get a distinct from stg table.

CREATE TABLE TEST_DB.TEMP_TABLE as 
SELECT DISTINCT emp_id, emp_name from TEST_DB.TEMP_TABLE_STG

Snowflake - Parallel processing and copy of large zip file to a SF Table

Question

1 answers

solution1
0 2023-01-20 07:31:42

Snowflake - Parallel processing and copy of large zip file to a SF Table

Question

1 answers

solution1 0 2023-01-20 07:31:42

solution1
0 2023-01-20 07:31:42