简体   繁体   English

仅将 s3 分区文件之一添加到 AWS Glue

[英]Adding only one of s3 partitioned files to AWS Glue

I'm having slight issues when it comes to running a crawler through my s3 buckets.在通过我的 s3 存储桶运行爬虫时,我遇到了一些小问题。 My folders have data that was dumped from redshift that was sliced into many different files.我的文件夹中有从 redshift 转储的数据,这些数据被切成许多不同的文件。 These files naming convention go as the following:这些文件命名约定如下:

dump_0000_part_00.gz, dump_0001_part_01.gz .... dump_0000_part_00.gz, dump_0001_part_01.gz ....

However when my crawler fetches the metadata in this folder, it instead makes a few hundred tables, thinking each one of these sliced files is its own table.但是,当我的爬虫获取此文件夹中的元数据时,它会生成几百个表,认为这些切片文件中的每一个都是自己的表。 Is there a way to tell the crawler to group all these sliced files into ONE catalog table?有没有办法告诉爬虫将所有这些切片文件分组到一个目录表中?

配置爬虫(或编辑现有爬虫)时,在Output部分下,展开Grouping behavior for S3 data (optional)并选择Create a single schema for each S3 path

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM