We have a process to load the data from a csv to a Snowflake Table. But as the input file is gzip format and after unzipping it is around 70 to 80 GB file. At present the process is like reading the gzip file and direct insert into the staging table. But, with the medium cluster it actually running around 3 to 3:30 hrs time to complete. Need to understand if any parallelism can be handled here for a faster processing.
CREATE OR REPLACE FILE FORMAT MANGEMENT.TEST_GZIP_FORMAT TYPE = CSV FIELD_DELIMITER = ';' SKIP_HEADER = 2 ESCAPE_UNENCLOSED_FIELD = NONE TRIM_SPACE = TRUE;
INSERT INTO TEST_DB.TEMP_TABLE (
emp_id, emp_name ) SELECT DISTINCT temp.$1 as emp_id,
temp.$2 AS emp_name from
/Azureserverlocation/test/apps/ (file_format => MANAGEMENT.TEST_GZIP_FORMAT, pattern=>'./test_file.gz') temp;
can you break your process into 2 stages and use copy into
.
COPY INTO
to copy data into a stage table.COPY INTO TEST_DB.TEMP_TABLE_STG
from '/Azureserverlocation/test/apps/'
file_format = (format_name=MANAGEMENT.TEST_GZIP_FORMAT, pattern=>'./test_file.gz');
distinct
from stg table.CREATE TABLE TEST_DB.TEMP_TABLE as
SELECT DISTINCT emp_id, emp_name from TEST_DB.TEMP_TABLE_STG
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.