Snowflake - 并行处理大型 zip 文件并将其复制到 SF 表

Question

We have a process to load the data from a csv to a Snowflake Table.我们有一个将数据从 csv 加载到雪花表的过程。 But as the input file is gzip format and after unzipping it is around 70 to 80 GB file.但由于输入文件是 gzip 格式，解压后大约有 70 到 80 GB 的文件。 At present the process is like reading the gzip file and direct insert into the staging table.目前这个过程就像读取 gzip 文件并直接插入到暂存表中一样。 But, with the medium cluster it actually running around 3 to 3:30 hrs time to complete.但是，对于中型集群，它实际上需要大约 3 到 3:30 的时间才能完成。 Need to understand if any parallelism can be handled here for a faster processing.需要了解是否可以在此处处理任何并行性以加快处理速度。

CREATE OR REPLACE FILE FORMAT MANGEMENT.TEST_GZIP_FORMAT TYPE = CSV FIELD_DELIMITER = ';' SKIP_HEADER = 2 ESCAPE_UNENCLOSED_FIELD = NONE TRIM_SPACE = TRUE;


INSERT INTO TEST_DB.TEMP_TABLE (
                        emp_id, emp_name ) SELECT DISTINCT temp.$1 as emp_id,
                        temp.$2  AS emp_name   from
                        /Azureserverlocation/test/apps/ (file_format => MANAGEMENT.TEST_GZIP_FORMAT, pattern=>'./test_file.gz') temp;

Answer 1

can you break your process into 2 stages and use copy into .你能把你的过程分成两个阶段并使用copy into .

use COPY INTO to copy data into a stage table.使用COPY INTO将数据复制到阶段表中。

COPY INTO TEST_DB.TEMP_TABLE_STG 
from '/Azureserverlocation/test/apps/'
file_format = (format_name=MANAGEMENT.TEST_GZIP_FORMAT, pattern=>'./test_file.gz');

Then get a distinct from stg table.然后得到一个与 stg distinct的表。

CREATE TABLE TEST_DB.TEMP_TABLE as 
SELECT DISTINCT emp_id, emp_name from TEST_DB.TEMP_TABLE_STG

Snowflake - 并行处理大型 zip 文件并将其复制到 SF 表

问题描述

1 个解决方案

解决方案1
0 2023-01-20 07:31:42

Snowflake - 并行处理大型 zip 文件并将其复制到 SF 表

问题描述

1 个解决方案

解决方案1 0 2023-01-20 07:31:42

解决方案1
0 2023-01-20 07:31:42