AWS Glue Crawler：想要 s3 中文件夹的单独表

Question

My s3 file structure is:我的 s3 文件结构是：

├── bucket
│   ├── customer_1
│   │   ├── year=2016
│   │   ├── year=2017
│   │   │   ├── month=11
│   │   |   │   ├── sometype-2017-11-01.parquet
│   |   |   |   ├── sometype-2017-11-02.parquet
│   |   |   |   ├── ...
│   │   │   ├── month=12
│   │   |   │   ├── sometype-2017-12-01.parquet
│   |   |   |   ├── sometype-2017-12-02.parquet
│   |   |   |   ├── ...
│   │   ├── year=2018
│   │   │   ├── month=01
│   │   |   │   ├── sometype-2018-01-01.parquet
│   |   |   |   ├── sometype-2018-01-02.parquet
│   |   |   |   ├── ...
│   ├── customer_2
│   │   ├── year=2017
│   │   │   ├── month=11
│   │   |   │   ├── moretype-2017-11-01.parquet
│   |   |   |   ├── moretype-2017-11-02.parquet
│   |   |   |   ├── ...
│   │   ├── year=...

I want create separate table for customer_1 and customer_2 with AWS Glue crawler.我想使用 AWS Glue 爬虫为 customer_1 和 customer_2 创建单独的表。 It is working if i mention path s3://bucket/customer_1 and s3://bucket/customer_2 .如果我提到路径s3://bucket/customer_1和s3://bucket/customer_2它正在工作。

I've tried s3://bucket/customer_* and s3://bucket/* , both are not working and can not create table in Glue catalog我已经尝试过s3://bucket/customer_*和s3://bucket/* ，两者都不起作用并且无法在 Glue 目录中创建表

Answer 1

I myself faced this issue recently.我本人最近遇到了这个问题。 AWS GLUE Crawlers has this option Grouping behaviour for S3 data . AWS GLUE Crawlers 有这个选项Grouping behaviour for S3 data 。 If the checkbox is not selected it will try to combine schemas.如果未选中该复选框，它将尝试合并模式。 By selecting the checkbox you can ensure that multiple and separate databases are created.通过选中该复选框，您可以确保创建多个单独的数据库。

The table level should be the depth from the root of the bucket, from where you want separate tables.表级别应该是从桶的根开始的深度，从你想要单独的表的地方开始。

In your case the depth would be 2.在您的情况下，深度为 2。

More here更多在这里

Answer 2

Glue's natural tendency is to add similar schemas(when pointed to the parent folder) to the same table with anything over than a 70% match(Assuming, In your case Cust1 and Cust2 have the same schemas). Glue 的自然倾向是将相似的模式（当指向父文件夹时）添加到同一个表，匹配度超过 70%（假设，在您的情况下，Cust1 和 Cust2 具有相同的模式）。 Keeping them in individual folders might create respective partitions based on the folder names.将它们保存在单独的文件夹中可能会根据文件夹名称创建相应的分区。

AWS Glue Crawler：想要 s3 中文件夹的单独表

问题描述

2 个解决方案

解决方案1
3 2021-10-06 14:43:23

解决方案2
2 2018-04-19 14:39:30

AWS Glue Crawler：想要 s3 中文件夹的单独表

问题描述

2 个解决方案

解决方案1 3 2021-10-06 14:43:23

解决方案2 2 2018-04-19 14:39:30

解决方案1
3 2021-10-06 14:43:23

解决方案2
2 2018-04-19 14:39:30