简体   繁体   中英

AWS Glue Crawler issue

I have a requirement for a ETL process where Raw data will be loaded into s3 buckt every day 1 time(Zip may contain 30 to 50 individual files with different schema). The data will be new every day and may or may not have same schema. I have unzipped teh data and loaded into 1 s3 bucket and crawled the files and run some jobs an processed the data. Now the problem is next day when new raw data is loaded and when I crawl teh newly updated folder again, tables remain the same in Glue catalog with same data reference.

What alternate option do I have if teh daya changes daily and new tables should be created the next day. or how can I read only the new data.??

I tried to crawl new folders with same crawler and same DB with different s3 folder.

It seems like the schema of the new raw files is same as the ones on which the Crawler already crawled. You will not see new table creation in this case. That's how it Crawler works.

To confirm this, Query the files using Athena by selecting the table that crawler created, you should be able to see all the data from all the files.

New tables will only be created if the schema of these new files is different.

To understand how Crawler works, give this doc a go.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM