雪花 - 无法将暂存区域中的 JSON 拆分文件复制到表

Question

我正在使用 Snowflake 和一些需要上传到暂存区域的 JSON 文件。 由于 Snowflake 不允许大小超过 1GB 的文件，我不得不使用 7zip 将它们拆分成更小的文件。

如您在所附图像中看到的那样，文件已上传到暂存区。

我正在尝试使用以下命令将暂存区域中的这些文件复制到另一个表

copy into yelp_user from @staging/yelp_academic_dataset_user.json.gz  file_format
                                  =(format_name=yelp_user) on_error='skip_file';

这让我遇到了这个错误：

002019 (0A000): SQL compilation error:JSON file format can produce one and only one column of type variant or object or array. Use CSV file format if you want to load more than one column.

然后我尝试创建一个 JSON 表：

CREATE OR REPLACE TABLE json_table_user(json_data variant);

copy into JSON_TABLE_USER  file_format =(format_name = 'yelp_user') files=('yelp_academic_dataset_user.json.001.gz','yelp_academic_dataset_user.json.002.gz','yelp_academic_dataset_user.json.003.gz','yelp_academic_dataset_user.json.004.gz') on_error = 'skip_file';

我得到错误说

Remote file 'https://gcpuscentral1-ncxq405-stage.storage.googleapis.com/tables/2807681033/yelp_academic_dataset_user.json.004.gz' was not found. There are several potential causes. The file might not exist. The required credentials may be missing or invalid. If you are running a copy command, please make sure files are not deleted when they are being loaded or files are not being loaded into two different tables concurrently with auto purge option.

这让我发疯，因为在 Snowflake 网站上的教程对我没有帮助。

有谁知道如何按照我需要的方式将这些拆分文件复制到表格中？

Answer 1

我不确定您的过程中出了什么问题，但我找到了一个类似的文件来重复它。

在这种情况下，我没有上传到 Snowflake 内部阶段，而是告诉 Snowflake 从 GCS 存储桶中读取它。 为此，首先我创建了一个集成：

use role accountadmin;

CREATE STORAGE INTEGRATION fh_gcp
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = GCS
  ENABLED = TRUE
  STORAGE_ALLOWED_LOCATIONS = ('gcs://fhoffa-snow/')  
;
describe integration fh_gcp; 
-- give access to the snowflake gcp account to the bucket in gcp
grant usage on integration fh_gcp to role sysadmin;
--

use role sysadmin;

create stage fh_gcp_stage
url = 'gcs://fhoffa-snow/'
storage_integration = fh_gcp;

list @fh_gcp_stage; -- check files exist

然后我稍微修改了你的 SQL 从这个阶段开始阅读。 请注意，我不需要拆分文件，Snowflake 很乐意读取大于 1gb 的文件：

create temp table json_table_user(json_data variant);

copy into JSON_TABLE_USER
from @fh_gcp_stage
file_format = (type=json)
files=('202104/yelp_academic_dataset_user.json.gz') on_error = 'skip_file'
;

然后你就可以开始享受查询和半结构化数据的乐趣了：

select median(json_data:average_stars) stars
    , median(json_data:review_count) reviews
    , median(json_data:funny) funny
    , median(json_data:useful) useful
    , count(*) c
from json_table_user ;

对您上面尝试的快速修复可能是将现有文件读入您创建的表中：

copy into json_table_user from @staging/yelp_academic_dataset_user.json.gz  file_format
                                  =(format_name=yelp_user) on_error='skip_file';

雪花 - 无法将暂存区域中的 JSON 拆分文件复制到表

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-04-27 02:51:13

雪花 - 无法将暂存区域中的 JSON 拆分文件复制到表

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-04-27 02:51:13

解决方案1
0 已采纳 2021-04-27 02:51:13