简体繁体 English

AWS Glue 爬虫问题

[英]AWS Glue Crawler issue

原文 2022-11-15 06:57:05 0 1 database/ amazon-web-services/ aws-glue

I have a requirement for a ETL process where Raw data will be loaded into s3 buckt every day 1 time(Zip may contain 30 to 50 individual files with different schema).我有一个 ETL 过程的要求，其中原始数据将每天 1 次加载到 s3 buckt（Zip 可能包含 30 到 50 个具有不同架构的单独文件）。 The data will be new every day and may or may not have same schema.数据每天都是新的，可能具有也可能不具有相同的模式。 I have unzipped teh data and loaded into 1 s3 bucket and crawled the files and run some jobs an processed the data.我解压缩了数据并将其加载到 1 个 s3 存储桶中并抓取了文件并运行了一些作业并处理了数据。 Now the problem is next day when new raw data is loaded and when I crawl teh newly updated folder again, tables remain the same in Glue catalog with same data reference.现在问题是第二天加载新的原始数据时，当我再次抓取新更新的文件夹时，Glue 目录中的表格保持不变，具有相同的数据引用。

What alternate option do I have if teh daya changes daily and new tables should be created the next day.如果 teh daya 每天都在变化并且应该在第二天创建新表，我有什么替代选择。 or how can I read only the new data.??或者我怎样才能只读取新数据。？？

I tried to crawl new folders with same crawler and same DB with different s3 folder.我尝试使用相同的爬虫和具有不同 s3 文件夹的相同数据库来抓取新文件夹。

1 个解决方案

It seems like the schema of the new raw files is same as the ones on which the Crawler already crawled.新原始文件的架构似乎与爬虫已经爬过的文件相同。 You will not see new table creation in this case.在这种情况下，您不会看到新表的创建。 That's how it Crawler works.这就是 Crawler 的工作原理。

To confirm this, Query the files using Athena by selecting the table that crawler created, you should be able to see all the data from all the files.要确认这一点，请使用 Athena 通过选择爬虫创建的表来查询文件，您应该能够看到所有文件中的所有数据。

New tables will only be created if the schema of these new files is different.仅当这些新文件的架构不同时才会创建新表。

To understand how Crawler works, give this doc a go.要了解 Crawler 的工作原理，请给本文档一个 go。