简体   繁体   English

AWS Glue Crawler:想要 s3 中文件夹的单独表

[英]AWS Glue Crawler: want separate table for folder in s3

My s3 file structure is:我的 s3 文件结构是:

├── bucket
│   ├── customer_1
│   │   ├── year=2016
│   │   ├── year=2017
│   │   │   ├── month=11
│   │   |   │   ├── sometype-2017-11-01.parquet
│   |   |   |   ├── sometype-2017-11-02.parquet
│   |   |   |   ├── ...
│   │   │   ├── month=12
│   │   |   │   ├── sometype-2017-12-01.parquet
│   |   |   |   ├── sometype-2017-12-02.parquet
│   |   |   |   ├── ...
│   │   ├── year=2018
│   │   │   ├── month=01
│   │   |   │   ├── sometype-2018-01-01.parquet
│   |   |   |   ├── sometype-2018-01-02.parquet
│   |   |   |   ├── ...
│   ├── customer_2
│   │   ├── year=2017
│   │   │   ├── month=11
│   │   |   │   ├── moretype-2017-11-01.parquet
│   |   |   |   ├── moretype-2017-11-02.parquet
│   |   |   |   ├── ...
│   │   ├── year=...

I want create separate table for customer_1 and customer_2 with AWS Glue crawler.我想使用 AWS Glue 爬虫为 customer_1 和 customer_2 创建单独的表。 It is working if i mention path s3://bucket/customer_1 and s3://bucket/customer_2 .如果我提到路径s3://bucket/customer_1s3://bucket/customer_2它正在工作。

I've tried s3://bucket/customer_* and s3://bucket/* , both are not working and can not create table in Glue catalog我已经尝试过s3://bucket/customer_*s3://bucket/* ,两者都不起作用并且无法在 Glue 目录中创建表

I myself faced this issue recently.我本人最近遇到了这个问题。 AWS GLUE Crawlers has this option Grouping behaviour for S3 data . AWS GLUE Crawlers 有这个选项Grouping behaviour for S3 data If the checkbox is not selected it will try to combine schemas.如果未选中该复选框,它将尝试合并模式。 By selecting the checkbox you can ensure that multiple and separate databases are created.通过选中该复选框,您可以确保创建多个单独的数据库。

The table level should be the depth from the root of the bucket, from where you want separate tables.表级别应该是从桶的根开始的深度,从你想要单独的表的地方开始。

In your case the depth would be 2.在您的情况下,深度为 2。

More here更多在这里

在此处输入图像描述

Glue's natural tendency is to add similar schemas(when pointed to the parent folder) to the same table with anything over than a 70% match(Assuming, In your case Cust1 and Cust2 have the same schemas). Glue 的自然倾向是将相似的模式(当指向父文件夹时)添加到同一个表,匹配度超过 70%(假设,在您的情况下,Cust1 和 Cust2 具有相同的模式)。 Keeping them in individual folders might create respective partitions based on the folder names.将它们保存在单独的文件夹中可能会根据文件夹名称创建相应的分区。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM