简体繁体中英

AWS Glue Crawler issue

原文 2022-11-15 06:57:05 4 1 database/ amazon-web-services/ aws-glue

I have a requirement for a ETL process where Raw data will be loaded into s3 buckt every day 1 time(Zip may contain 30 to 50 individual files with different schema). The data will be new every day and may or may not have same schema. I have unzipped teh data and loaded into 1 s3 bucket and crawled the files and run some jobs an processed the data. Now the problem is next day when new raw data is loaded and when I crawl teh newly updated folder again, tables remain the same in Glue catalog with same data reference.

What alternate option do I have if teh daya changes daily and new tables should be created the next day. or how can I read only the new data.??

I tried to crawl new folders with same crawler and same DB with different s3 folder.

1 answers

It seems like the schema of the new raw files is same as the ones on which the Crawler already crawled. You will not see new table creation in this case. That's how it Crawler works.

To confirm this, Query the files using Athena by selecting the table that crawler created, you should be able to see all the data from all the files.

New tables will only be created if the schema of these new files is different.

To understand how Crawler works, give this doc a go.

update schedule of a glue crawler on aws

AWS Glue Crawler cannot parse large files (classification UNKNOWN)

AWS Glue Crawler - Crawl new folders only - Internal Service Exception

AWS Glue : How to make sure glue crawler always picks up the latest file from S3

Cast Issue with AWS Glue 3.0 - Pyspark

How can I exclude specific folders with a specific year for the crawler in AWS Glue?

AWS Athena Return Zero Records from Tables Created by GLUE Crawler input csv from S3

Glue crawler creating multiple tables

Issue developing AWS Glue ETL jobs locally using a Docker container

Step function hanging on glue crawler step

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question update schedule of a glue crawler on aws AWS Glue Crawler cannot parse large files (classification UNKNOWN) AWS Glue Crawler - Crawl new folders only - Internal Service Exception AWS Glue : How to make sure glue crawler always picks up the latest file from S3 Cast Issue with AWS Glue 3.0 - Pyspark How can I exclude specific folders with a specific year for the crawler in AWS Glue? AWS Athena Return Zero Records from Tables Created by GLUE Crawler input csv from S3 Glue crawler creating multiple tables Issue developing AWS Glue ETL jobs locally using a Docker container Step function hanging on glue crawler step

Related Tags

AWS Glue Crawler issue

Question

1 answers

solution1 1 2022-11-16 09:14:00

solution1
1 2022-11-16 09:14:00